Sublime Forum

Syntax Fun

#41

@315234 (awesome name): Use the (?x) switch. It will ignore all spaces after it, except for those within character sets. geocities.jp/kosako3/oniguruma/doc/RE.txt

0 Likes

#42

Allowing a captured subexpression in the scope would allow creating the following syntax for sublime-syntax files.

contexts:
  main:
    - match: ""
      push: Packages/YAML/YAML.sublime-syntax
      with_prototype:
        - match: ([a-z]+(?:\.[a-z]+)+)
          scope: \1

Basically it would assign to every scope expression (dot separated list of words) that precise expression as its scope.

Besides being a file that describes its own syntax (which is neat) and a nice example of a templating language; it would be useful when editing other syntax files because you would be able to see how your current theme would color the scope that you are assigning on the syntax file itself.

0 Likes

#43

By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.

0 Likes

#44

@jps,
In sublimetext.com/docs/3/syntax.html, in the ā€œheaderā€ section where it says ā€œfile_line_matchā€ should be ā€œfirst_line_matchā€. Found out the hard wayā€¦ :laughing:

0 Likes

#45

An update on this: Iā€™m currently doing some work that I expect will give a significant speedup for syntax highlighting. This will primarily benefit loading large files and file indexing. I donā€™t expect this to have any compatibility issues wrt regex flavour. Early results are promising, but thereā€™s more work to be done until itā€™s ready. Next build wonā€™t be until next week at the earliest, and that may well turn out to be optimistic.

@FichteFoll thanks for looking into it, will fix

@farsouth, thanks, will fix

0 Likes

#46

@jps do you anticipate ever superseding all the TextMate formats (tmTheme, tmPreferences, ā€¦)? Not sure thereā€™s a business case for doing so, but it would be cool to have a ā€œpure Sublimeā€ environment :stuck_out_tongue:

0 Likes

#47

Are there plans for a substitution syntax in regular expressions?
As I am crawling the YAML spec I find myself needing certain regex patterns like yaml.org/spec/1.2/spec.html#ns-uri-char frequently. Itā€™s a pain to copy&paste these patterns all the time. It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.

Hopefully you add named capture groups. :slight_smile:

0 Likes

#48

[quote=ā€œsimonzackā€]By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.[/quote]

I created a syntax highlighting using the new file format which is able to detect argument lists across lines like this, for Fortran. It pushes a new scope when it sees the equivalent of ā€œ(a,ā€ and only pops it when it sees the closing parenthesis. You could try something similar. My code is here:

github.com/315234/SublimeFortran

0 Likes

#49

Thatā€™s the idea generally, but the reason this is an unusual case, for ES anyway, is that itā€™s a (fortunately rare) bit of syntax where you need to look at a later token, which is permitted to fall on another line, in order to disambiguate the initial construct. Until you hit either a rest operator or the fat arrow, the syntax of the parameter group could as easily be a parenthesized expression. Since arrow functions are themselves expressions, and can appear in exactly the same places parenthesized expressions may appear, thereā€™s no outer context you can use to help.

The good news here is that in ES itā€™s honestly pretty weird to have a line break before the arrow, and this is likewise true for the other small handful of potentially ambiguous cases, like binding patterns (for destructured assignment) vs literal objects and arrays, so you can use an Akira Tanaka Special to match braces and get it right 99.9% of the time. Hereā€™s what I used when I was attempting an ES6 sublime-syntax (since given up temporarily, because I kept running into problems with hanging Sublime that I couldnā€™t successfully debug):

  arrowFunctionWithParens:
    - match: '(?x) \( (?= (?<parens> ^\(\)] | \( \g<parens>* \) )* \)\s*=> )'
      scope: punctuation.definition.parameters.begin
      set:  arrowFunction_AFTER_PARAMS, params ]

For binding patterns vs literals itā€™s basically the same deal, except the lookahead wants to match {} or ] and see if it is followed by the vanilla assignment operator. In this case admittedly itā€™s a little more likely to span lines, but I still think thatā€™s a weird enough case not to fret over. Plus, most of the time a destructured assignment will appear in a declaration, not an expression, in which case thereā€™s no need for a lookahead at all; for example the opening bracket in the following unambiguously begins an array binding pattern:

let 
0 Likes

#50

An update on where Iā€™m at: Iā€™ve built a new regex engine, replacing Oniguruma for the majority of the matching tasks within the syntax highlighter. Speed of the hybrid system is around twice that of the one in 3084, which will make large file loading and file indexing substantially more efficient. Aside from speed, there should be no user visible changes, and the regex syntax is identical.

Plan is to tie off the loose ends and get a new build out next week, and then get back to updating the syntax definitions.

In hindsight, given the regex flavour did end up being identical, I would have put off this work until later, but I wasnā€™t sure that was going to be the case until recently.

0 Likes

#51

Can you send me an email at jps@sublimetext.com with your .sublime-syntax file and a file that triggers a lockup? Iā€™ll see what I can do to make sure it doesnā€™t happen.

0 Likes

#52

Speedup is always good. :wink:

Since the flavor will be the same as Oniguruma apparently, maybe you could provide a bit better documentation for \p{} Character Properties and POSIX brackets in character sets? geocities.jp/kosako3/oniguruma/doc/RE.txt is not too verbose there.

About other features for syntax definitions:

The main use case I am imagining would be templating languages (e.g. all the HTML stuff) where templating is not allowed everywhere but only in certain contexts (e.g. not in strings). Currently it appears that if there is a string context .
Another example would be special highlighting of JSON or YAML files where known keys (e.g. sublime-* files) receive special highlighting to give the user some direct feedback. Ideally youā€™d only need to import/push the base syntax and then inject a few patterns that would only match the key name but not the value.

(not entirely relevant here, but) Any thoughts on this?

[quote=ā€œFichteFollā€]Are there plans for a substitution syntax in regular expressions?
It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.[/quote]

This would be very appreciated. If you want to do syntax definitions right, such as consider allowed characters for identifiers, you would need to include complex regular expression like (source\p{Letter} \p{Nl} _
\x{2118} \x{212E} \x{309B} \x{309C}] && ^\p{Me}]]
\p{Word} \p{Nl}
\x{2118} \x{212E} \x{309B} \x{309C}
\x{00B7} \x{0387} \x{1369}-\x{1371} \x{19DA}] && ^\p{Me}]]*
[/code]
That is just not feasable.

If you are considering this, there needs to be a way to use patterns in the definition of patterns as well.

0 Likes

#53

Thanks. Iā€™ll try to put together the most minimal example that triggers it since the whole thing is pretty vast. Of course, Iā€™m not at all certain that it isnā€™t something Iā€™m just doing wrong. FWIW, any file will trigger the lockup, even an empty one.

0 Likes

#54

The general rule I use is, if it needs to access the window, it should be a WindowCommand. TextCommands are generally restricted to things that manipulate the text buffer in some way, such as sorting the lines.

Agreed. Even writing simple things, such as a valid C identifier: [a-zA-Z_][a-zA-Z_0-9]* gets tedious quite quickly. Iā€™ll look into in, but something along the lines of {{var}} seems reasonable.

[quote=ā€œbathosā€]

Thanks. Iā€™ll try to put together the most minimal example that triggers it since the whole thing is pretty vast. Of course, Iā€™m not at all certain that it isnā€™t something Iā€™m just doing wrong. FWIW, any file will trigger the lockup, even an empty one.[/quote]

Thanks - minimal is nice, but donā€™t put too much effort into it. Itā€™s a lot simpler on my end with a debugger in hand to faff about with that.

0 Likes

#55

+1 for substitution!

0 Likes

#56

Sorry for the delay in getting an example prepped (work came first) but it may not be needed anymore. Not sure what changes were made that would affect this, but I started experimenting again tonight and so far with the new build Iā€™m not seeing any lock up! (fingers crossed)

The new variable system is great. Iā€™m not sure how many languages define their identifier chars against Unicode properties, but this snippet for ES6 identifiers (not including invalidation of reserved words) may be useful to others:

variables:
  # Unicode character property shims
  ID_Continue: '{{ID_Start}}\p{Mn}\p{Mc}\p{Nd}\p{Pc}{{Other_ID_Continue}}'
  ID_Start: '\p{L}\p{Nl}{{Other_ID_Start}}'
  Other_ID_Continue: 'Ā·Ā·į©-į±į§š'
  Other_ID_Start: 'ā„˜ā„®ć‚›ć‚œ'
  # Identifier
  identifier: '{{identifierStart}}{{identifierPart}}'
  identifierPart: '(?:\$_ā€ā€{{ID_Continue}}]|{{unicodeEscape}})*' # note! ZWJ & ZWNJ after _
  identifierStart: '(?:\$_{{ID_Start}}]|{{unicodeEscape}})'
  unicodeEscape: '\\u(?:\h{4}|\{\h+\})'

The ZWJ and ZWNJ will likely not C&P from a browser so youā€™ll have to add them in manually if your language permits them (theyā€™re not as obscure as might be assumed by a Latin alphabet native ā€“ theyā€™re critical for many scripts). The above definitions for ID_Start and ID_Continue are technically incomplete because the real defs subtract any overlaps from Pattern_Whitespace and Pattern_Syntax. Though possible using Onigurumaā€™s subtractive &&^whatever], I didnā€™t bother because I think itā€™s already thorough enough: those two properties are vast and the overlaps are very obscure, while the important thing to me was mainly to ensure the matches work for all human languages.

0 Likes

#57

Regarding the syntax test files, I would like to see a way of testing multiple columns of a line in a single query.

So, instead of

[code]

  • !foo ā€œā€

^ storage.type.tag-handle

^ storage.type.tag-handle

^ storage.type.tag-handle

^ storage.type.tag-handle[/code]

I would be doing

[code]

  • !foo ā€œā€

^^^^ storage.type.tag-handle[/code]

The first would obviously get infeasable for larger matches.

Another idea, that accompanies the above, is a way to also define columns where the scope selector should not match, e.g.

[code]

  • !foo ā€œā€

^^^^x storage.type.tag-handle[/code]

(open for symbol suggestions)

0 Likes

#58

with regards to the negative match, you can use the not operator in the selector:

- !foo ""
#  ^ - storage.type.tag-handle
0 Likes

#59

I know that, but by adding additional syntax you could do it in one line. Itā€™s fairly common to specify a scope matching only for a certain region in a line and not matching anywhere else in the line ā€“ or at least in most of the remaining line. Thatā€™s why it would be accompanying the other change for allowing ā€œmore columns to be matched in a single lineā€.

0 Likes

#60

Iā€™ll add that the ideal version supports testing any number of discontinuous positions in the line, and not just a range.

0 Likes