The latest work is done in the develop branch Iāll be pushing some helper dev tools for creating language test files shortly, although maybe any more work is moot at this point
Btw, Jon, could it be possible to differentiate between .c and .h (or any body and header type) to, then, be able to assign an unique icon to each of them? Itās not that important, but it would be neat to provide visual difference :v
I donāt know if that differentiation is a matter of syntax definiton or tmPreferences, but right now, I donāt think itās possible with the current implementationā¦
With the current icon system, itād require header files to have a separate syntax definition (which would presumably just include the regular one), which wouldnāt be unhandy in any case.
Interesting. The push and pop stuff reminds of how pygments does their syntax highlighter. I guess itās not an uncommon approach. Look forward to playing with it.
Textmate 2 has nice feature that Iād like to see in SublimeText and thatās being able to generate the syntax name base on what youāre matching. here is a cool example from Markdown syntax in textmate:
heading = {
name = 'markup.heading.${1/(#)(#)?(#)?(#)?(#)?(#)?/${6:?6:${5:?5:${4:?4:${3:?3:${2:?2:1}}}}}/}.markdown';
begin = '(?:^|\G)(#{1,6})\s*(?=\S^#]])';
end = '\s*(#{1,6})?$\n?';
captures = { 1 = { name = 'punctuation.definition.heading.markdown'; }; };
contentName = 'entity.name.section.markdown';
patterns = ( { include = '#inline'; } );
}
It adds scope names like markup.heading.2.markdown and markup.heading.3.markdown depending on the heading level of the markdown. Iāve recreated these scopes by duplicating the match definition 6 times.
I would be awesome to have some basic string replacement capabilities based on the match regex in scope name.
Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.[/quote]
Excellent. Since the new system looks much nicer to work with, I hope this will stimulate community involvement in the creation of new and bugfix/update of existing language definitions.
To return to an earlier point, Iām wondering just how much room there is for improvement in Sublimeās syntax parser and what benefits such improvements could bring. I referred previously to expanding APIs in this area to facilitate the development of language aware plugins like code-refactoring, but it also struck me that more precise and detailed scopes could allow for the implementation of useful editor functions, for example āgoto referenceā (inverse of goto definition), in a lightweight manner. Thoughts?
Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines canāt influence the prior lines.[/quote]
Do you know if an LL(1) parser is sufficient for syntax-highlighting C++? I know that C++ canāt be lexed with anything less than a Turing Machine but I donāt if syntax highlighting is hard as lexing.
Great work so far! I absolutely love the new syntax file spec. Iāve been terrified about writing syntax files till now but I think I will give it another go once this is done. There are a lot of issues with C++ syntax highlighting that I want to fix.
This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itās not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
By the way, SyntaxHighlightTools goes in a different direction that defines a custom YAML-like syntax but itās not really suited for the context model imo.
Itās nice that it bundles the tmPreferences stuff however which āworksā but is still pretty clunky. Itās afaik the only extension file type that is still only available using ye old textmate-compatible plists besides tmTheme. Any plans for working on that in the future?
That sounds great! I would also like to know how much the usage of atomic groups for the popular keyword matches would improve performance compared to normal grouped matches because I never see them used anyway, besides maybe my own syntaxes.
find_by_selector or extract_scope are really useful in that regard, but sometimes a slightly better syntax crawling could be useful. Something similar to what jQuery does maybe with the parent, child or sibling queries. This should all be doable with the current APIs and some neat algorithms however, except for the weird case that facelessuser described. If I ever needed something more complex Iād likely write some of these algorithms myself but it would be interesting to know how taxing these two API calls are (for large files) and if it makes sense to implement more performant variants in ST itself.
Now to my own feedback regarding the new syntax format, which is a bit on the technical part.
The possibility to match multi-line stuff with barer context control is great.
The possibility to more or less āinjectā matches into included contexts or other languages is also great with the prototype stuff.
Have you thought of including/pushing a specific context of a different syntax definition? Such as push: "Packages/HTML/HTML.sublime-syntax:tag" or using another character that is usually invalid in filepaths. That would allow more flexible and organised syntax management and not require to define many multiple hidden syntax definitions.
Would it be feasable to allow some kind of reflexive context-awareness for these with_prototype-injected patterns? E.g. some way to only apply the pattern if āstringā matches the entire current scope or only the deepest scope name. Or maybe just make it apply only in certain contexts.
You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextās match, but this assumes that the parent context has a match in the first place, correct?
What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?
What will patterns backref if a context was changed using the set key?
Are named capture groups an option instead of always using numbers? It would surely be cleaner to have named capture groups and assign scope names to those groups by their name instead of the number sometimes. Oniguruma seems to support this according to the spec but itās not enabled for syntax definitions and Iām donāt know if the options are exclusive.
What happens when both YAML.tmLanguage and YAML.sublime-syntax are present?
The current YAML syntax def is still not exactly correct, but its better than the previous one. Iāll wait until you settled the ābringing the syntax definitions up-to-dateā (which hopefully results in some form of open sourcing for contributions) for fixing, however.
At this stage, no. YAML if a great format if youāre familiar with it, but the syntax can be quite unintuitive at times (e.g., exactly when a string needs to be quoted is not always obvious). Iām not ruling it out, itās just something Iām approaching with caution.
This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itās not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
[/quote]
Now that Iāve spent more time working with YAML, Iām still not sure how I feel about it. I donāt like the way if effectively forces 2 space indenting, and I donāt like that thereās just so much stuff in the spec, but once you know whatās going on it is nice to write. Itās easy enough to justify for things like syntax definitions, but not so much for everything else. Who would ever guess that double quotes and single quotes have different escaping rules, for example?
Itās not mentioned in the documentation, but you can use contexts in other .sublime-syntax files in the same way as local ones (e.g., push them or include them). They can be referenced via āPackages/Foo/Foo.sublime-syntaxā, āPackages/Foo/Foo.sublime-syntax#mainā (#main is implied if not present), or āPackages/Foo/Foo.sublime-syntax#stringsā.
Currently the named context, and the transitive closure of the contexts it references are cloned and rewritten with the patterns mentioned in with_prototype. Iām open to extending this so that some contexts wonāt be rewritten, but a use case would be nice
[quote=āFichteFollā]You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextās match, but this assumes that the parent context has a match in the first place, correct?
What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?
What will patterns backref if a context was changed using the set key?
[/quote]
If the pop pattern (this processing is only done to the first, btw) has any backrefs, then when the context is pushed onto the stack, the pop pattern will be rewritten with the values that the push or set captured when the context was pushed onto the stack. If the regex used to push the context doesnāt define a corresponding capture, then the empty string is substituted.
This also has the implication that pop regexes are unable to use backrefs in the normal way, as theyāll be rewritten before the regex engine has a chance to see them. You can work around this with named captures though, which are ignored by the pop regex rewriting logic.
With regards to set vs push, they operate in exactly the same way, set just pops the current context and then pushes the given context(s) on the stack.
In principle they could be, I just havenāt done the work to enable them.
No special handling, theyāll just appear in the menu as two separate syntax definitions. My current recommendation is to leave both, but mark the tmLanguage as hidden.
That doesnāt surprise me, YAML is not easy to get right!
Just a heads up, there may be some changes to the regex flavour used by sublime-syntax files in the next build, so you may not want to spend too much time playing with them in the current form.
I just discovered the conversion code and inspected a bit. (Initially because it didnāt work for me.)
The reason why I looked into it in the first place: It doesnāt work for me.
Traceback (most recent call last):
File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 535, in run_
return self.run()
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 415, in run
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 344, in convert
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 205, in make_context
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 264, in make_context
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 127, in format_external_syntax
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 113, in syntax_for_scope
File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 103, in build_scope_map
File "./plistlib.py", line 104, in readPlistFromBytes
raise AttributeError(attr)
File "./plistlib.py", line 76, in readPlist
#
File "./plistlib.py", line 378, in parse
raise ValueError("unexpected key at line %d" %
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 233, column 24
So, it appears that I have an invalid plist file in my packages somewhere. This needs to be caught and the conversion must continue. Furthermore, it would be nice if the name of the badly formatted file was displayed.
The format_regex function strips all leading whitespace of regular expressions and destroys indentation. I suggest using a function that detects the base indentation of a string (while overseeing the first line since that will most likely start with (?x)) and then strip only that indent instead of everything. Something like
def format_regex(s):
if "\n" in s:
lines = s.splitlines(True)
s = lines[0] + textwrap.dedent(''.join(lines[1:]))
s = s.rstrip("\n") + "\n"
return s
You use plistlib for parsing. Iāve had problems
with that before, notably that import xml.parsers.expat failed on some linux distros which plistlib uses. This was all on ST2 though, but you might still want to look into it. I also had problems with plistlib sometimes reporting an āunknown encodingā for files with <?xml version="1.0" encoding="UTF-8"?>. I have no idea why that would have happened since thatās what it emits with plistlib.writePlist too.
ConvertSyntaxCommand is a WindowCommand, despite only really operating on a view. Is there a certain reason for this? If itās because TextCommands automatically come with an edit parameter (and create an undo point, though Iām not sure about that) then I propose to add a new āViewCommandā that has the view object in self.view but does not automatically create an edit. The reason is that window.active_view() is not āportableā and doesnāt work as expected if run with some_view.run_command("") and some_view is inactive.
import yaml should really be part of the if __name__ == "__main__": section
Edit: I discovered the $base include a while ago but never found out what it actually does. Considering itās translated to $top_level_main I assume that this is only relevant if the syntax def this include is in has been included from externally?
By the way, Iām willing to work on an āofficialā Python.sublime-syntax file. I have a lot of experience with syntax definitions in general and hacked on the PythonImproved syntax to decent extend. Because of copyright issues Iād start from scratch, which would also result in a clean structure I suppose.
Excellent work Jon, already much better than working with the tmLanguage files.
Would it be possible to ignore spaces in the regular expressions? This would help to break up regular expressions into groups to make reading and editing them easier. Currently they are just big regexes on one line. If I have missed a way to break them up visually please let me know.