Interesting. The push and pop stuff reminds of how pygments does their syntax highlighter. I guess itās not an uncommon approach. Look forward to playing with it.
Textmate 2 has nice feature that Iād like to see in SublimeText and thatās being able to generate the syntax name base on what youāre matching. here is a cool example from Markdown syntax in textmate:
heading = {
name = 'markup.heading.${1/(#)(#)?(#)?(#)?(#)?(#)?/${6:?6:${5:?5:${4:?4:${3:?3:${2:?2:1}}}}}/}.markdown';
begin = '(?:^|\G)(#{1,6})\s*(?=\S^#]])';
end = '\s*(#{1,6})?$\n?';
captures = { 1 = { name = 'punctuation.definition.heading.markdown'; }; };
contentName = 'entity.name.section.markdown';
patterns = ( { include = '#inline'; } );
}
It adds scope names like markup.heading.2.markdown and markup.heading.3.markdown depending on the heading level of the markdown. Iāve recreated these scopes by duplicating the match definition 6 times.
I would be awesome to have some basic string replacement capabilities based on the match regex in scope name.
Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.[/quote]
Excellent. Since the new system looks much nicer to work with, I hope this will stimulate community involvement in the creation of new and bugfix/update of existing language definitions.
To return to an earlier point, Iām wondering just how much room there is for improvement in Sublimeās syntax parser and what benefits such improvements could bring. I referred previously to expanding APIs in this area to facilitate the development of language aware plugins like code-refactoring, but it also struck me that more precise and detailed scopes could allow for the implementation of useful editor functions, for example āgoto referenceā (inverse of goto definition), in a lightweight manner. Thoughts?
Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines canāt influence the prior lines.[/quote]
Do you know if an LL(1) parser is sufficient for syntax-highlighting C++? I know that C++ canāt be lexed with anything less than a Turing Machine but I donāt if syntax highlighting is hard as lexing.
Great work so far! I absolutely love the new syntax file spec. Iāve been terrified about writing syntax files till now but I think I will give it another go once this is done. There are a lot of issues with C++ syntax highlighting that I want to fix.
This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itās not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
By the way, SyntaxHighlightTools goes in a different direction that defines a custom YAML-like syntax but itās not really suited for the context model imo.
Itās nice that it bundles the tmPreferences stuff however which āworksā but is still pretty clunky. Itās afaik the only extension file type that is still only available using ye old textmate-compatible plists besides tmTheme. Any plans for working on that in the future?
That sounds great! I would also like to know how much the usage of atomic groups for the popular keyword matches would improve performance compared to normal grouped matches because I never see them used anyway, besides maybe my own syntaxes.
find_by_selector or extract_scope are really useful in that regard, but sometimes a slightly better syntax crawling could be useful. Something similar to what jQuery does maybe with the parent, child or sibling queries. This should all be doable with the current APIs and some neat algorithms however, except for the weird case that facelessuser described. If I ever needed something more complex Iād likely write some of these algorithms myself but it would be interesting to know how taxing these two API calls are (for large files) and if it makes sense to implement more performant variants in ST itself.
Now to my own feedback regarding the new syntax format, which is a bit on the technical part.
The possibility to match multi-line stuff with barer context control is great.
The possibility to more or less āinjectā matches into included contexts or other languages is also great with the prototype stuff.
Have you thought of including/pushing a specific context of a different syntax definition? Such as push: "Packages/HTML/HTML.sublime-syntax:tag" or using another character that is usually invalid in filepaths. That would allow more flexible and organised syntax management and not require to define many multiple hidden syntax definitions.
Would it be feasable to allow some kind of reflexive context-awareness for these with_prototype-injected patterns? E.g. some way to only apply the pattern if āstringā matches the entire current scope or only the deepest scope name. Or maybe just make it apply only in certain contexts.
You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextās match, but this assumes that the parent context has a match in the first place, correct?
What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?
What will patterns backref if a context was changed using the set key?
Are named capture groups an option instead of always using numbers? It would surely be cleaner to have named capture groups and assign scope names to those groups by their name instead of the number sometimes. Oniguruma seems to support this according to the spec but itās not enabled for syntax definitions and Iām donāt know if the options are exclusive.
What happens when both YAML.tmLanguage and YAML.sublime-syntax are present?
The current YAML syntax def is still not exactly correct, but its better than the previous one. Iāll wait until you settled the ābringing the syntax definitions up-to-dateā (which hopefully results in some form of open sourcing for contributions) for fixing, however.
At this stage, no. YAML if a great format if youāre familiar with it, but the syntax can be quite unintuitive at times (e.g., exactly when a string needs to be quoted is not always obvious). Iām not ruling it out, itās just something Iām approaching with caution.
This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itās not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
[/quote]
Now that Iāve spent more time working with YAML, Iām still not sure how I feel about it. I donāt like the way if effectively forces 2 space indenting, and I donāt like that thereās just so much stuff in the spec, but once you know whatās going on it is nice to write. Itās easy enough to justify for things like syntax definitions, but not so much for everything else. Who would ever guess that double quotes and single quotes have different escaping rules, for example?
Itās not mentioned in the documentation, but you can use contexts in other .sublime-syntax files in the same way as local ones (e.g., push them or include them). They can be referenced via āPackages/Foo/Foo.sublime-syntaxā, āPackages/Foo/Foo.sublime-syntax#mainā (#main is implied if not present), or āPackages/Foo/Foo.sublime-syntax#stringsā.
Currently the named context, and the transitive closure of the contexts it references are cloned and rewritten with the patterns mentioned in with_prototype. Iām open to extending this so that some contexts wonāt be rewritten, but a use case would be nice
[quote=āFichteFollā]You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextās match, but this assumes that the parent context has a match in the first place, correct?
What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?
What will patterns backref if a context was changed using the set key?
[/quote]
If the pop pattern (this processing is only done to the first, btw) has any backrefs, then when the context is pushed onto the stack, the pop pattern will be rewritten with the values that the push or set captured when the context was pushed onto the stack. If the regex used to push the context doesnāt define a corresponding capture, then the empty string is substituted.
This also has the implication that pop regexes are unable to use backrefs in the normal way, as theyāll be rewritten before the regex engine has a chance to see them. You can work around this with named captures though, which are ignored by the pop regex rewriting logic.
With regards to set vs push, they operate in exactly the same way, set just pops the current context and then pushes the given context(s) on the stack.
In principle they could be, I just havenāt done the work to enable them.
No special handling, theyāll just appear in the menu as two separate syntax definitions. My current recommendation is to leave both, but mark the tmLanguage as hidden.
That doesnāt surprise me, YAML is not easy to get right!
Just a heads up, there may be some changes to the regex flavour used by sublime-syntax files in the next build, so you may not want to spend too much time playing with them in the current form.
I just discovered the conversion code and inspected a bit. (Initially because it didnāt work for me.)
The reason why I looked into it in the first place: It doesnāt work for me.
Traceback (most recent call last):
File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 535, in run_
return self.run()
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 415, in run
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 344, in convert
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 205, in make_context
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 264, in make_context
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 127, in format_external_syntax
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 113, in syntax_for_scope
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 103, in build_scope_map
File "./plistlib.py", line 104, in readPlistFromBytes
raise AttributeError(attr)
File "./plistlib.py", line 76, in readPlist
#
File "./plistlib.py", line 378, in parse
raise ValueError("unexpected key at line %d" %
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 233, column 24
So, it appears that I have an invalid plist file in my packages somewhere. This needs to be caught and the conversion must continue. Furthermore, it would be nice if the name of the badly formatted file was displayed.
The format_regex function strips all leading whitespace of regular expressions and destroys indentation. I suggest using a function that detects the base indentation of a string (while overseeing the first line since that will most likely start with (?x)) and then strip only that indent instead of everything. Something like
def format_regex(s):
if "\n" in s:
lines = s.splitlines(True)
s = lines[0] + textwrap.dedent(''.join(lines[1:]))
s = s.rstrip("\n") + "\n"
return s
You use plistlib for parsing. Iāve had problems
with that before, notably that import xml.parsers.expat failed on some linux distros which plistlib uses. This was all on ST2 though, but you might still want to look into it. I also had problems with plistlib sometimes reporting an āunknown encodingā for files with <?xml version="1.0" encoding="UTF-8"?>. I have no idea why that would have happened since thatās what it emits with plistlib.writePlist too.
ConvertSyntaxCommand is a WindowCommand, despite only really operating on a view. Is there a certain reason for this? If itās because TextCommands automatically come with an edit parameter (and create an undo point, though Iām not sure about that) then I propose to add a new āViewCommandā that has the view object in self.view but does not automatically create an edit. The reason is that window.active_view() is not āportableā and doesnāt work as expected if run with some_view.run_command("") and some_view is inactive.
import yaml should really be part of the if __name__ == "__main__": section
Edit: I discovered the $base include a while ago but never found out what it actually does. Considering itās translated to $top_level_main I assume that this is only relevant if the syntax def this include is in has been included from externally?
By the way, Iām willing to work on an āofficialā Python.sublime-syntax file. I have a lot of experience with syntax definitions in general and hacked on the PythonImproved syntax to decent extend. Because of copyright issues Iād start from scratch, which would also result in a clean structure I suppose.
Excellent work Jon, already much better than working with the tmLanguage files.
Would it be possible to ignore spaces in the regular expressions? This would help to break up regular expressions into groups to make reading and editing them easier. Currently they are just big regexes on one line. If I have missed a way to break them up visually please let me know.
Basically it would assign to every scope expression (dot separated list of words) that precise expression as its scope.
Besides being a file that describes its own syntax (which is neat) and a nice example of a templating language; it would be useful when editing other syntax files because you would be able to see how your current theme would color the scope that you are assigning on the syntax file itself.
By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:
(a,
b,
c) => 1
to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.
@jps,
In sublimetext.com/docs/3/syntax.html, in the āheaderā section where it says āfile_line_matchā should be āfirst_line_matchā. Found out the hard wayā¦
An update on this: Iām currently doing some work that I expect will give a significant speedup for syntax highlighting. This will primarily benefit loading large files and file indexing. I donāt expect this to have any compatibility issues wrt regex flavour. Early results are promising, but thereās more work to be done until itās ready. Next build wonāt be until next week at the earliest, and that may well turn out to be optimistic.
@jps do you anticipate ever superseding all the TextMate formats (tmTheme, tmPreferences, ā¦)? Not sure thereās a business case for doing so, but it would be cool to have a āpure Sublimeā environment
Are there plans for a substitution syntax in regular expressions?
As I am crawling the YAML spec I find myself needing certain regex patterns like yaml.org/spec/1.2/spec.html#ns-uri-char frequently. Itās a pain to copy&paste these patterns all the time. It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.
[quote=āsimonzackā]By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:
(a,
b,
c) => 1
to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.[/quote]
I created a syntax highlighting using the new file format which is able to detect argument lists across lines like this, for Fortran. It pushes a new scope when it sees the equivalent of ā(a,ā and only pops it when it sees the closing parenthesis. You could try something similar. My code is here:
Thatās the idea generally, but the reason this is an unusual case, for ES anyway, is that itās a (fortunately rare) bit of syntax where you need to look at a later token, which is permitted to fall on another line, in order to disambiguate the initial construct. Until you hit either a rest operator or the fat arrow, the syntax of the parameter group could as easily be a parenthesized expression. Since arrow functions are themselves expressions, and can appear in exactly the same places parenthesized expressions may appear, thereās no outer context you can use to help.
The good news here is that in ES itās honestly pretty weird to have a line break before the arrow, and this is likewise true for the other small handful of potentially ambiguous cases, like binding patterns (for destructured assignment) vs literal objects and arrays, so you can use an Akira Tanaka Special to match braces and get it right 99.9% of the time. Hereās what I used when I was attempting an ES6 sublime-syntax (since given up temporarily, because I kept running into problems with hanging Sublime that I couldnāt successfully debug):
For binding patterns vs literals itās basically the same deal, except the lookahead wants to match {} or ] and see if it is followed by the vanilla assignment operator. In this case admittedly itās a little more likely to span lines, but I still think thatās a weird enough case not to fret over. Plus, most of the time a destructured assignment will appear in a declaration, not an expression, in which case thereās no need for a lookahead at all; for example the opening bracket in the following unambiguously begins an array binding pattern:
An update on where Iām at: Iāve built a new regex engine, replacing Oniguruma for the majority of the matching tasks within the syntax highlighter. Speed of the hybrid system is around twice that of the one in 3084, which will make large file loading and file indexing substantially more efficient. Aside from speed, there should be no user visible changes, and the regex syntax is identical.
Plan is to tie off the loose ends and get a new build out next week, and then get back to updating the syntax definitions.
In hindsight, given the regex flavour did end up being identical, I would have put off this work until later, but I wasnāt sure that was going to be the case until recently.