Sublime Forum

**facelessuser** · January 13, 2016, 9:00am

Interesting. The push and pop stuff reminds of how pygments does their syntax highlighter. I guess it’s not an uncommon approach. Look forward to playing with it.

**jps** · January 13, 2016, 9:00am

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.

**aziz** · January 13, 2016, 9:00am

Textmate 2 has nice feature that I’d like to see in SublimeText and that’s being able to generate the syntax name base on what you’re matching. here is a cool example from Markdown syntax in textmate:

heading = {
  name = 'markup.heading.${1/(#)(#)?(#)?(#)?(#)?(#)?/${6:?6:${5:?5:${4:?4:${3:?3:${2:?2:1}}}}}/}.markdown';
  begin = '(?:^|\G)(#{1,6})\s*(?=\S^#]])';
  end = '\s*(#{1,6})?$\n?';
  captures = { 1 = { name = 'punctuation.definition.heading.markdown'; }; };
  contentName = 'entity.name.section.markdown';
  patterns = ( { include = '#inline'; } );
}

It adds scope names like markup.heading.2.markdown and markup.heading.3.markdown depending on the heading level of the markdown. I’ve recreated these scopes by duplicating the match definition 6 times.
I would be awesome to have some basic string replacement capabilities based on the match regex in scope name.

**qgates** · January 13, 2016, 9:00am

[quote=“jps”]

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.[/quote]

Excellent. Since the new system looks much nicer to work with, I hope this will stimulate community involvement in the creation of new and bugfix/update of existing language definitions.

To return to an earlier point, I’m wondering just how much room there is for improvement in Sublime’s syntax parser and what benefits such improvements could bring. I referred previously to expanding APIs in this area to facilitate the development of language aware plugins like code-refactoring, but it also struck me that more precise and detailed scopes could allow for the implementation of useful editor functions, for example “goto reference” (inverse of goto definition), in a lightweight manner. Thoughts?

**sunnyps** · January 13, 2016, 9:00am

[quote=“jps”]

Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines can’t influence the prior lines.[/quote]

Do you know if an LL(1) parser is sufficient for syntax-highlighting C++? I know that C++ can’t be lexed with anything less than a Turing Machine but I don’t if syntax highlighting is hard as lexing.

Great work so far! I absolutely love the new syntax file spec. I’ve been terrified about writing syntax files till now but I think I will give it another go once this is done. There are a lot of issues with C++ syntax highlighting that I want to fix.

**FichteFoll** · January 13, 2016, 9:00am

First of, awesome!
Secondly, but muh time.

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. It’s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.

By the way, SyntaxHighlightTools goes in a different direction that defines a custom YAML-like syntax but it’s not really suited for the context model imo.
It’s nice that it bundles the tmPreferences stuff however which “works” but is still pretty clunky. It’s afaik the only extension file type that is still only available using ye old textmate-compatible plists besides tmTheme. Any plans for working on that in the future?

That sounds great! I would also like to know how much the usage of atomic groups for the popular keyword matches would improve performance compared to normal grouped matches because I never see them used anyway, besides maybe my own syntaxes.

find_by_selector or extract_scope are really useful in that regard, but sometimes a slightly better syntax crawling could be useful. Something similar to what jQuery does maybe with the parent, child or sibling queries. This should all be doable with the current APIs and some neat algorithms however, except for the weird case that facelessuser described. If I ever needed something more complex I’d likely write some of these algorithms myself but it would be interesting to know how taxing these two API calls are (for large files) and if it makes sense to implement more performant variants in ST itself.

Now to my own feedback regarding the new syntax format, which is a bit on the technical part.

The possibility to match multi-line stuff with barer context control is great.
The possibility to more or less “inject” matches into included contexts or other languages is also great with the prototype stuff.
Have you thought of including/pushing a specific context of a different syntax definition? Such as push: "Packages/HTML/HTML.sublime-syntax:tag" or using another character that is usually invalid in filepaths. That would allow more flexible and organised syntax management and not require to define many multiple hidden syntax definitions.
Would it be feasable to allow some kind of reflexive context-awareness for these with_prototype-injected patterns? E.g. some way to only apply the pattern if “string” matches the entire current scope or only the deepest scope name. Or maybe just make it apply only in certain contexts.
You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent context’s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?

Are named capture groups an option instead of always using numbers? It would surely be cleaner to have named capture groups and assign scope names to those groups by their name instead of the number sometimes. Oniguruma seems to support this according to the spec but it’s not enabled for syntax definitions and I’m don’t know if the options are exclusive.
What happens when both YAML.tmLanguage and YAML.sublime-syntax are present?
The current YAML syntax def is still not exactly correct, but its better than the previous one. I’ll wait until you settled the “bringing the syntax definitions up-to-date” (which hopefully results in some form of open sourcing for contributions) for fixing, however.

**jps** · January 13, 2016, 9:00am

[quote=“FichteFoll”]First of, awesome!

At this stage, no. YAML if a great format if you’re familiar with it, but the syntax can be quite unintuitive at times (e.g., exactly when a string needs to be quoted is not always obvious). I’m not ruling it out, it’s just something I’m approaching with caution.

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. It’s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
[/quote]

Now that I’ve spent more time working with YAML, I’m still not sure how I feel about it. I don’t like the way if effectively forces 2 space indenting, and I don’t like that there’s just so much stuff in the spec, but once you know what’s going on it is nice to write. It’s easy enough to justify for things like syntax definitions, but not so much for everything else. Who would ever guess that double quotes and single quotes have different escaping rules, for example?

It’s not mentioned in the documentation, but you can use contexts in other .sublime-syntax files in the same way as local ones (e.g., push them or include them). They can be referenced via “Packages/Foo/Foo.sublime-syntax”, “Packages/Foo/Foo.sublime-syntax#main” (#main is implied if not present), or “Packages/Foo/Foo.sublime-syntax#strings”.

Currently the named context, and the transitive closure of the contexts it references are cloned and rewritten with the patterns mentioned in with_prototype. I’m open to extending this so that some contexts won’t be rewritten, but a use case would be nice

[quote=“FichteFoll”]You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent context’s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?
[/quote]

If the pop pattern (this processing is only done to the first, btw) has any backrefs, then when the context is pushed onto the stack, the pop pattern will be rewritten with the values that the push or set captured when the context was pushed onto the stack. If the regex used to push the context doesn’t define a corresponding capture, then the empty string is substituted.

This also has the implication that pop regexes are unable to use backrefs in the normal way, as they’ll be rewritten before the regex engine has a chance to see them. You can work around this with named captures though, which are ignored by the pop regex rewriting logic.

With regards to set vs push, they operate in exactly the same way, set just pops the current context and then pushes the given context(s) on the stack.

In principle they could be, I just haven’t done the work to enable them.

No special handling, they’ll just appear in the menu as two separate syntax definitions. My current recommendation is to leave both, but mark the tmLanguage as hidden.

That doesn’t surprise me, YAML is not easy to get right!

**jps** · January 13, 2016, 9:00am

Just a heads up, there may be some changes to the regex flavour used by sublime-syntax files in the next build, so you may not want to spend too much time playing with them in the current form.

**FichteFoll** · January 13, 2016, 9:00am

I just discovered the conversion code and inspected a bit. (Initially because it didn’t work for me.)

The reason why I looked into it in the first place: It doesn’t work for me.

Traceback (most recent call last): File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 535, in run_ return self.run() File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 415, in run File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 344, in convert File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 205, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 264, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 127, in format_external_syntax File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 113, in syntax_for_scope File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 103, in build_scope_map File "./plistlib.py", line 104, in readPlistFromBytes raise AttributeError(attr) File "./plistlib.py", line 76, in readPlist # File "./plistlib.py", line 378, in parse raise ValueError("unexpected key at line %d" % xml.parsers.expat.ExpatError: not well-formed (invalid token): line 233, column 24

So, it appears that I have an invalid plist file in my packages somewhere. This needs to be caught and the conversion must continue. Furthermore, it would be nice if the name of the badly formatted file was displayed.

The format_regex function strips all leading whitespace of regular expressions and destroys indentation. I suggest using a function that detects the base indentation of a string (while overseeing the first line since that will most likely start with (?x)) and then strip only that indent instead of everything. Something like

def format_regex(s): if "\n" in s: lines = s.splitlines(True) s = lines[0] + textwrap.dedent(''.join(lines[1:])) s = s.rstrip("\n") + "\n" return s

You use plistlib for parsing. I’ve had problems
with that before, notably that import xml.parsers.expat failed on some linux distros which plistlib uses. This was all on ST2 though, but you might still want to look into it. I also had problems with plistlib sometimes reporting an “unknown encoding” for files with <?xml version="1.0" encoding="UTF-8"?>. I have no idea why that would have happened since that’s what it emits with plistlib.writePlist too.

ConvertSyntaxCommand is a WindowCommand, despite only really operating on a view. Is there a certain reason for this? If it’s because TextCommands automatically come with an edit parameter (and create an undo point, though I’m not sure about that) then I propose to add a new “ViewCommand” that has the view object in self.view but does not automatically create an edit. The reason is that window.active_view() is not “portable” and doesn’t work as expected if run with some_view.run_command("") and some_view is inactive.

import yaml should really be part of the if __name__ == "__main__": section
Edit: I discovered the $base include a while ago but never found out what it actually does. Considering it’s translated to $top_level_main I assume that this is only relevant if the syntax def this include is in has been included from externally?

By the way, I’m willing to work on an “official” Python.sublime-syntax file. I have a lot of experience with syntax definitions in general and hacked on the PythonImproved syntax to decent extend. Because of copyright issues I’d start from scratch, which would also result in a clean structure I suppose.

**315234** · January 13, 2016, 9:00am

Excellent work Jon, already much better than working with the tmLanguage files.

Would it be possible to ignore spaces in the regular expressions? This would help to break up regular expressions into groups to make reading and editing them easier. Currently they are just big regexes on one line. If I have missed a way to break them up visually please let me know.

**FichteFoll** · January 13, 2016, 9:00am

@315234 (awesome name): Use the (?x) switch. It will ignore all spaces after it, except for those within character sets. geocities.jp/kosako3/oniguruma/doc/RE.txt

**chubio** · January 13, 2016, 9:00am

Allowing a captured subexpression in the scope would allow creating the following syntax for sublime-syntax files.

contexts:
  main:
    - match: ""
      push: Packages/YAML/YAML.sublime-syntax
      with_prototype:
        - match: ([a-z]+(?:\.[a-z]+)+)
          scope: \1

Basically it would assign to every scope expression (dot separated list of words) that precise expression as its scope.

Besides being a file that describes its own syntax (which is neat) and a nice example of a templating language; it would be useful when editing other syntax files because you would be able to see how your current theme would color the scope that you are assigning on the syntax file itself.

**simonzack** · January 13, 2016, 9:00am

By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.

**farsouth** · January 13, 2016, 9:00am

@jps,
In sublimetext.com/docs/3/syntax.html, in the “header” section where it says “file_line_match” should be “first_line_match”. Found out the hard way…

**jps** · January 13, 2016, 9:00am

An update on this: I’m currently doing some work that I expect will give a significant speedup for syntax highlighting. This will primarily benefit loading large files and file indexing. I don’t expect this to have any compatibility issues wrt regex flavour. Early results are promising, but there’s more work to be done until it’s ready. Next build won’t be until next week at the earliest, and that may well turn out to be optimistic.

@FichteFoll thanks for looking into it, will fix

@farsouth, thanks, will fix

**frou** · January 13, 2016, 9:00am

@jps do you anticipate ever superseding all the TextMate formats (tmTheme, tmPreferences, …)? Not sure there’s a business case for doing so, but it would be cool to have a “pure Sublime” environment

**FichteFoll** · January 13, 2016, 9:00am

Are there plans for a substitution syntax in regular expressions?
As I am crawling the YAML spec I find myself needing certain regex patterns like yaml.org/spec/1.2/spec.html#ns-uri-char frequently. It’s a pain to copy&paste these patterns all the time. It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.

Hopefully you add named capture groups.

**315234** · January 13, 2016, 9:01am

[quote=“simonzack”]By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.[/quote]

I created a syntax highlighting using the new file format which is able to detect argument lists across lines like this, for Fortran. It pushes a new scope when it sees the equivalent of “(a,” and only pops it when it sees the closing parenthesis. You could try something similar. My code is here:

github.com/315234/SublimeFortran

**bathos** · January 13, 2016, 9:01am

That’s the idea generally, but the reason this is an unusual case, for ES anyway, is that it’s a (fortunately rare) bit of syntax where you need to look at a later token, which is permitted to fall on another line, in order to disambiguate the initial construct. Until you hit either a rest operator or the fat arrow, the syntax of the parameter group could as easily be a parenthesized expression. Since arrow functions are themselves expressions, and can appear in exactly the same places parenthesized expressions may appear, there’s no outer context you can use to help.

The good news here is that in ES it’s honestly pretty weird to have a line break before the arrow, and this is likewise true for the other small handful of potentially ambiguous cases, like binding patterns (for destructured assignment) vs literal objects and arrays, so you can use an Akira Tanaka Special to match braces and get it right 99.9% of the time. Here’s what I used when I was attempting an ES6 sublime-syntax (since given up temporarily, because I kept running into problems with hanging Sublime that I couldn’t successfully debug):

  arrowFunctionWithParens:
    - match: '(?x) \( (?= (?<parens> ^\(\)] | \( \g<parens>* \) )* \)\s*=> )'
      scope: punctuation.definition.parameters.begin
      set:  arrowFunction_AFTER_PARAMS, params ]

For binding patterns vs literals it’s basically the same deal, except the lookahead wants to match {} or ] and see if it is followed by the vanilla assignment operator. In this case admittedly it’s a little more likely to span lines, but I still think that’s a weird enough case not to fret over. Plus, most of the time a destructured assignment will appear in a declaration, not an expression, in which case there’s no need for a lookahead at all; for example the opening bracket in the following unambiguously begins an array binding pattern:

let

**jps** · January 13, 2016, 9:01am

An update on where I’m at: I’ve built a new regex engine, replacing Oniguruma for the majority of the matching tasks within the syntax highlighter. Speed of the hybrid system is around twice that of the one in 3084, which will make large file loading and file indexing substantially more efficient. Aside from speed, there should be no user visible changes, and the regex syntax is identical.

Plan is to tie off the loose ends and get a new build out next week, and then get back to updating the syntax definitions.

In hindsight, given the regex flavour did end up being identical, I would have put off this work until later, but I wasn’t sure that was going to be the case until recently.

Syntax Fun