Sublime Forum

Syntax Fun

#31

Interesting. The push and pop stuff reminds of how pygments does their syntax highlighter. I guess itā€™s not an uncommon approach. Look forward to playing with it.

0 Likes

#32

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.

0 Likes

Tmlanguage to sublime-syntax convertor
#33

Textmate 2 has nice feature that Iā€™d like to see in SublimeText and thatā€™s being able to generate the syntax name base on what youā€™re matching. here is a cool example from Markdown syntax in textmate:

heading = {
  name = 'markup.heading.${1/(#)(#)?(#)?(#)?(#)?(#)?/${6:?6:${5:?5:${4:?4:${3:?3:${2:?2:1}}}}}/}.markdown';
  begin = '(?:^|\G)(#{1,6})\s*(?=\S^#]])';
  end = '\s*(#{1,6})?$\n?';
  captures = { 1 = { name = 'punctuation.definition.heading.markdown'; }; };
  contentName = 'entity.name.section.markdown';
  patterns = ( { include = '#inline'; } );
}

It adds scope names like markup.heading.2.markdown and markup.heading.3.markdown depending on the heading level of the markdown. Iā€™ve recreated these scopes by duplicating the match definition 6 times.
I would be awesome to have some basic string replacement capabilities based on the match regex in scope name.

1 Like

#34

[quote=ā€œjpsā€]

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.[/quote]

Excellent. Since the new system looks much nicer to work with, I hope this will stimulate community involvement in the creation of new and bugfix/update of existing language definitions.

To return to an earlier point, Iā€™m wondering just how much room there is for improvement in Sublimeā€™s syntax parser and what benefits such improvements could bring. I referred previously to expanding APIs in this area to facilitate the development of language aware plugins like code-refactoring, but it also struck me that more precise and detailed scopes could allow for the implementation of useful editor functions, for example ā€œgoto referenceā€ (inverse of goto definition), in a lightweight manner. Thoughts?

0 Likes

#35

[quote=ā€œjpsā€]

Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines canā€™t influence the prior lines.[/quote]

Do you know if an LL(1) parser is sufficient for syntax-highlighting C++? I know that C++ canā€™t be lexed with anything less than a Turing Machine but I donā€™t if syntax highlighting is hard as lexing.

Great work so far! I absolutely love the new syntax file spec. Iā€™ve been terrified about writing syntax files till now but I think I will give it another go once this is done. There are a lot of issues with C++ syntax highlighting that I want to fix. :smile:

0 Likes

#36

First of, awesome!
Secondly, but muh time. :frowning:

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itā€™s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.

By the way, SyntaxHighlightTools goes in a different direction that defines a custom YAML-like syntax but itā€™s not really suited for the context model imo.
Itā€™s nice that it bundles the tmPreferences stuff however which ā€œworksā€ but is still pretty clunky. Itā€™s afaik the only extension file type that is still only available using ye old textmate-compatible plists besides tmTheme. Any plans for working on that in the future?

That sounds great! I would also like to know how much the usage of atomic groups for the popular keyword matches would improve performance compared to normal grouped matches because I never see them used anyway, besides maybe my own syntaxes.

find_by_selector or extract_scope are really useful in that regard, but sometimes a slightly better syntax crawling could be useful. Something similar to what jQuery does maybe with the parent, child or sibling queries. This should all be doable with the current APIs and some neat algorithms however, except for the weird case that facelessuser described. If I ever needed something more complex Iā€™d likely write some of these algorithms myself but it would be interesting to know how taxing these two API calls are (for large files) and if it makes sense to implement more performant variants in ST itself.

Now to my own feedback regarding the new syntax format, which is a bit on the technical part.

  1. The possibility to match multi-line stuff with barer context control is great.

  2. The possibility to more or less ā€œinjectā€ matches into included contexts or other languages is also great with the prototype stuff.
    Have you thought of including/pushing a specific context of a different syntax definition? Such as push: "Packages/HTML/HTML.sublime-syntax:tag" or using another character that is usually invalid in filepaths. That would allow more flexible and organised syntax management and not require to define many multiple hidden syntax definitions.

  3. Would it be feasable to allow some kind of reflexive context-awareness for these with_prototype-injected patterns? E.g. some way to only apply the pattern if ā€œstringā€ matches the entire current scope or only the deepest scope name. Or maybe just make it apply only in certain contexts.

  4. You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextā€™s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?

  1. Are named capture groups an option instead of always using numbers? It would surely be cleaner to have named capture groups and assign scope names to those groups by their name instead of the number sometimes. Oniguruma seems to support this according to the spec but itā€™s not enabled for syntax definitions and Iā€™m donā€™t know if the options are exclusive.

  2. What happens when both YAML.tmLanguage and YAML.sublime-syntax are present?

  3. The current YAML syntax def is still not exactly correct, but its better than the previous one. Iā€™ll wait until you settled the ā€œbringing the syntax definitions up-to-dateā€ (which hopefully results in some form of open sourcing for contributions) for fixing, however.

0 Likes

Will there be a .sublime-something replacement for the .tmPreferences file format?
#37

[quote=ā€œFichteFollā€]First of, awesome!

At this stage, no. YAML if a great format if youā€™re familiar with it, but the syntax can be quite unintuitive at times (e.g., exactly when a string needs to be quoted is not always obvious). Iā€™m not ruling it out, itā€™s just something Iā€™m approaching with caution.

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. Itā€™s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
[/quote]

Now that Iā€™ve spent more time working with YAML, Iā€™m still not sure how I feel about it. I donā€™t like the way if effectively forces 2 space indenting, and I donā€™t like that thereā€™s just so much stuff in the spec, but once you know whatā€™s going on it is nice to write. Itā€™s easy enough to justify for things like syntax definitions, but not so much for everything else. Who would ever guess that double quotes and single quotes have different escaping rules, for example?

Itā€™s not mentioned in the documentation, but you can use contexts in other .sublime-syntax files in the same way as local ones (e.g., push them or include them). They can be referenced via ā€œPackages/Foo/Foo.sublime-syntaxā€, ā€œPackages/Foo/Foo.sublime-syntax#mainā€ (#main is implied if not present), or ā€œPackages/Foo/Foo.sublime-syntax#stringsā€.

Currently the named context, and the transitive closure of the contexts it references are cloned and rewritten with the patterns mentioned in with_prototype. Iā€™m open to extending this so that some contexts wonā€™t be rewritten, but a use case would be nice :slight_smile:

[quote=ā€œFichteFollā€]You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent contextā€™s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?
[/quote]

If the pop pattern (this processing is only done to the first, btw) has any backrefs, then when the context is pushed onto the stack, the pop pattern will be rewritten with the values that the push or set captured when the context was pushed onto the stack. If the regex used to push the context doesnā€™t define a corresponding capture, then the empty string is substituted.

This also has the implication that pop regexes are unable to use backrefs in the normal way, as theyā€™ll be rewritten before the regex engine has a chance to see them. You can work around this with named captures though, which are ignored by the pop regex rewriting logic.

With regards to set vs push, they operate in exactly the same way, set just pops the current context and then pushes the given context(s) on the stack.

In principle they could be, I just havenā€™t done the work to enable them.

No special handling, theyā€™ll just appear in the menu as two separate syntax definitions. My current recommendation is to leave both, but mark the tmLanguage as hidden.

That doesnā€™t surprise me, YAML is not easy to get right!

1 Like

[sublime-syntax] Allow \1 in patterns that don't pop
#38

Just a heads up, there may be some changes to the regex flavour used by sublime-syntax files in the next build, so you may not want to spend too much time playing with them in the current form.

0 Likes

#39

I just discovered the conversion code and inspected a bit. (Initially because it didnā€™t work for me.)

  1. The reason why I looked into it in the first place: It doesnā€™t work for me.

Traceback (most recent call last): File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 535, in run_ return self.run() File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 415, in run File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 344, in convert File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 205, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 264, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 127, in format_external_syntax File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 113, in syntax_for_scope File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 103, in build_scope_map File "./plistlib.py", line 104, in readPlistFromBytes raise AttributeError(attr) File "./plistlib.py", line 76, in readPlist # File "./plistlib.py", line 378, in parse raise ValueError("unexpected key at line %d" % xml.parsers.expat.ExpatError: not well-formed (invalid token): line 233, column 24

So, it appears that I have an invalid plist file in my packages somewhere. This needs to be caught and the conversion must continue. Furthermore, it would be nice if the name of the badly formatted file was displayed.

The format_regex function strips all leading whitespace of regular expressions and destroys indentation. I suggest using a function that detects the base indentation of a string (while overseeing the first line since that will most likely start with (?x)) and then strip only that indent instead of everything. Something like

def format_regex(s): if "\n" in s: lines = s.splitlines(True) s = lines[0] + textwrap.dedent(''.join(lines[1:])) s = s.rstrip("\n") + "\n" return s

You use plistlib for parsing. Iā€™ve had problems
with that before, notably that import xml.parsers.expat failed on some linux distros which plistlib uses. This was all on ST2 though, but you might still want to look into it. I also had problems with plistlib sometimes reporting an ā€œunknown encodingā€ for files with <?xml version="1.0" encoding="UTF-8"?>. I have no idea why that would have happened since thatā€™s what it emits with plistlib.writePlist too.

ConvertSyntaxCommand is a WindowCommand, despite only really operating on a view. Is there a certain reason for this? If itā€™s because TextCommands automatically come with an edit parameter (and create an undo point, though Iā€™m not sure about that) then I propose to add a new ā€œViewCommandā€ that has the view object in self.view but does not automatically create an edit. The reason is that window.active_view() is not ā€œportableā€ and doesnā€™t work as expected if run with some_view.run_command("") and some_view is inactive.

  1. import yaml should really be part of the if __name__ == "__main__": section

  2. Edit: I discovered the $base include a while ago but never found out what it actually does. Considering itā€™s translated to $top_level_main I assume that this is only relevant if the syntax def this include is in has been included from externally?

By the way, Iā€™m willing to work on an ā€œofficialā€ Python.sublime-syntax file. I have a lot of experience with syntax definitions in general and hacked on the PythonImproved syntax to decent extend. Because of copyright issues Iā€™d start from scratch, which would also result in a clean structure I suppose.

0 Likes

#40

Excellent work Jon, already much better than working with the tmLanguage files.

Would it be possible to ignore spaces in the regular expressions? This would help to break up regular expressions into groups to make reading and editing them easier. Currently they are just big regexes on one line. If I have missed a way to break them up visually please let me know.

0 Likes

#41

@315234 (awesome name): Use the (?x) switch. It will ignore all spaces after it, except for those within character sets. geocities.jp/kosako3/oniguruma/doc/RE.txt

0 Likes

#42

Allowing a captured subexpression in the scope would allow creating the following syntax for sublime-syntax files.

contexts:
  main:
    - match: ""
      push: Packages/YAML/YAML.sublime-syntax
      with_prototype:
        - match: ([a-z]+(?:\.[a-z]+)+)
          scope: \1

Basically it would assign to every scope expression (dot separated list of words) that precise expression as its scope.

Besides being a file that describes its own syntax (which is neat) and a nice example of a templating language; it would be useful when editing other syntax files because you would be able to see how your current theme would color the scope that you are assigning on the syntax file itself.

0 Likes

#43

By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.

0 Likes

#44

@jps,
In sublimetext.com/docs/3/syntax.html, in the ā€œheaderā€ section where it says ā€œfile_line_matchā€ should be ā€œfirst_line_matchā€. Found out the hard wayā€¦ :laughing:

0 Likes

#45

An update on this: Iā€™m currently doing some work that I expect will give a significant speedup for syntax highlighting. This will primarily benefit loading large files and file indexing. I donā€™t expect this to have any compatibility issues wrt regex flavour. Early results are promising, but thereā€™s more work to be done until itā€™s ready. Next build wonā€™t be until next week at the earliest, and that may well turn out to be optimistic.

@FichteFoll thanks for looking into it, will fix

@farsouth, thanks, will fix

0 Likes

#46

@jps do you anticipate ever superseding all the TextMate formats (tmTheme, tmPreferences, ā€¦)? Not sure thereā€™s a business case for doing so, but it would be cool to have a ā€œpure Sublimeā€ environment :stuck_out_tongue:

0 Likes

#47

Are there plans for a substitution syntax in regular expressions?
As I am crawling the YAML spec I find myself needing certain regex patterns like yaml.org/spec/1.2/spec.html#ns-uri-char frequently. Itā€™s a pain to copy&paste these patterns all the time. It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.

Hopefully you add named capture groups. :slight_smile:

0 Likes

#48

[quote=ā€œsimonzackā€]By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.[/quote]

I created a syntax highlighting using the new file format which is able to detect argument lists across lines like this, for Fortran. It pushes a new scope when it sees the equivalent of ā€œ(a,ā€ and only pops it when it sees the closing parenthesis. You could try something similar. My code is here:

github.com/315234/SublimeFortran

0 Likes

#49

Thatā€™s the idea generally, but the reason this is an unusual case, for ES anyway, is that itā€™s a (fortunately rare) bit of syntax where you need to look at a later token, which is permitted to fall on another line, in order to disambiguate the initial construct. Until you hit either a rest operator or the fat arrow, the syntax of the parameter group could as easily be a parenthesized expression. Since arrow functions are themselves expressions, and can appear in exactly the same places parenthesized expressions may appear, thereā€™s no outer context you can use to help.

The good news here is that in ES itā€™s honestly pretty weird to have a line break before the arrow, and this is likewise true for the other small handful of potentially ambiguous cases, like binding patterns (for destructured assignment) vs literal objects and arrays, so you can use an Akira Tanaka Special to match braces and get it right 99.9% of the time. Hereā€™s what I used when I was attempting an ES6 sublime-syntax (since given up temporarily, because I kept running into problems with hanging Sublime that I couldnā€™t successfully debug):

  arrowFunctionWithParens:
    - match: '(?x) \( (?= (?<parens> ^\(\)] | \( \g<parens>* \) )* \)\s*=> )'
      scope: punctuation.definition.parameters.begin
      set:  arrowFunction_AFTER_PARAMS, params ]

For binding patterns vs literals itā€™s basically the same deal, except the lookahead wants to match {} or ] and see if it is followed by the vanilla assignment operator. In this case admittedly itā€™s a little more likely to span lines, but I still think thatā€™s a weird enough case not to fret over. Plus, most of the time a destructured assignment will appear in a declaration, not an expression, in which case thereā€™s no need for a lookahead at all; for example the opening bracket in the following unambiguously begins an array binding pattern:

let 
0 Likes

#50

An update on where Iā€™m at: Iā€™ve built a new regex engine, replacing Oniguruma for the majority of the matching tasks within the syntax highlighter. Speed of the hybrid system is around twice that of the one in 3084, which will make large file loading and file indexing substantially more efficient. Aside from speed, there should be no user visible changes, and the regex syntax is identical.

Plan is to tie off the loose ends and get a new build out next week, and then get back to updating the syntax definitions.

In hindsight, given the regex flavour did end up being identical, I would have put off this work until later, but I wasnā€™t sure that was going to be the case until recently.

0 Likes