Sublime Forum

Syntax Fun

#45

An update on this: I’m currently doing some work that I expect will give a significant speedup for syntax highlighting. This will primarily benefit loading large files and file indexing. I don’t expect this to have any compatibility issues wrt regex flavour. Early results are promising, but there’s more work to be done until it’s ready. Next build won’t be until next week at the earliest, and that may well turn out to be optimistic.

@FichteFoll thanks for looking into it, will fix

@farsouth, thanks, will fix

0 Likes

#46

@jps do you anticipate ever superseding all the TextMate formats (tmTheme, tmPreferences, …)? Not sure there’s a business case for doing so, but it would be cool to have a “pure Sublime” environment :stuck_out_tongue:

0 Likes

#47

Are there plans for a substitution syntax in regular expressions?
As I am crawling the YAML spec I find myself needing certain regex patterns like yaml.org/spec/1.2/spec.html#ns-uri-char frequently. It’s a pain to copy&paste these patterns all the time. It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.

Hopefully you add named capture groups. :slight_smile:

0 Likes

#48

[quote=“simonzack”]By the way, LR parsers are more powerful while still using linear time. The current syntax is unable to parse the following:

(a,
b,
c) => 1

to identify parameters (a, b, c) in javascript ES6. As regexes is restricted to single lines. This is also what flex/bison uses, and flex/bison is one of the the most popular parsers today.[/quote]

I created a syntax highlighting using the new file format which is able to detect argument lists across lines like this, for Fortran. It pushes a new scope when it sees the equivalent of “(a,” and only pops it when it sees the closing parenthesis. You could try something similar. My code is here:

github.com/315234/SublimeFortran

0 Likes

#49

That’s the idea generally, but the reason this is an unusual case, for ES anyway, is that it’s a (fortunately rare) bit of syntax where you need to look at a later token, which is permitted to fall on another line, in order to disambiguate the initial construct. Until you hit either a rest operator or the fat arrow, the syntax of the parameter group could as easily be a parenthesized expression. Since arrow functions are themselves expressions, and can appear in exactly the same places parenthesized expressions may appear, there’s no outer context you can use to help.

The good news here is that in ES it’s honestly pretty weird to have a line break before the arrow, and this is likewise true for the other small handful of potentially ambiguous cases, like binding patterns (for destructured assignment) vs literal objects and arrays, so you can use an Akira Tanaka Special to match braces and get it right 99.9% of the time. Here’s what I used when I was attempting an ES6 sublime-syntax (since given up temporarily, because I kept running into problems with hanging Sublime that I couldn’t successfully debug):

  arrowFunctionWithParens:
    - match: '(?x) \( (?= (?<parens> ^\(\)] | \( \g<parens>* \) )* \)\s*=> )'
      scope: punctuation.definition.parameters.begin
      set:  arrowFunction_AFTER_PARAMS, params ]

For binding patterns vs literals it’s basically the same deal, except the lookahead wants to match {} or ] and see if it is followed by the vanilla assignment operator. In this case admittedly it’s a little more likely to span lines, but I still think that’s a weird enough case not to fret over. Plus, most of the time a destructured assignment will appear in a declaration, not an expression, in which case there’s no need for a lookahead at all; for example the opening bracket in the following unambiguously begins an array binding pattern:

let 
0 Likes

#50

An update on where I’m at: I’ve built a new regex engine, replacing Oniguruma for the majority of the matching tasks within the syntax highlighter. Speed of the hybrid system is around twice that of the one in 3084, which will make large file loading and file indexing substantially more efficient. Aside from speed, there should be no user visible changes, and the regex syntax is identical.

Plan is to tie off the loose ends and get a new build out next week, and then get back to updating the syntax definitions.

In hindsight, given the regex flavour did end up being identical, I would have put off this work until later, but I wasn’t sure that was going to be the case until recently.

0 Likes

#51

Can you send me an email at jps@sublimetext.com with your .sublime-syntax file and a file that triggers a lockup? I’ll see what I can do to make sure it doesn’t happen.

0 Likes

#52

Speedup is always good. :wink:

Since the flavor will be the same as Oniguruma apparently, maybe you could provide a bit better documentation for \p{} Character Properties and POSIX brackets in character sets? geocities.jp/kosako3/oniguruma/doc/RE.txt is not too verbose there.

About other features for syntax definitions:

The main use case I am imagining would be templating languages (e.g. all the HTML stuff) where templating is not allowed everywhere but only in certain contexts (e.g. not in strings). Currently it appears that if there is a string context .
Another example would be special highlighting of JSON or YAML files where known keys (e.g. sublime-* files) receive special highlighting to give the user some direct feedback. Ideally you’d only need to import/push the base syntax and then inject a few patterns that would only match the key name but not the value.

(not entirely relevant here, but) Any thoughts on this?

[quote=“FichteFoll”]Are there plans for a substitution syntax in regular expressions?
It would be a lot easier if there was some syntax where I could write e.g. {{ns-uri-char}} inside a regular expression and have it substituted by the actual pattern which I defined elsewhere. Other syntax suggestions welcome.[/quote]

This would be very appreciated. If you want to do syntax definitions right, such as consider allowed characters for identifiers, you would need to include complex regular expression like (source\p{Letter} \p{Nl} _
\x{2118} \x{212E} \x{309B} \x{309C}] && ^\p{Me}]]
\p{Word} \p{Nl}
\x{2118} \x{212E} \x{309B} \x{309C}
\x{00B7} \x{0387} \x{1369}-\x{1371} \x{19DA}] && ^\p{Me}]]*
[/code]
That is just not feasable.

If you are considering this, there needs to be a way to use patterns in the definition of patterns as well.

0 Likes

#53

Thanks. I’ll try to put together the most minimal example that triggers it since the whole thing is pretty vast. Of course, I’m not at all certain that it isn’t something I’m just doing wrong. FWIW, any file will trigger the lockup, even an empty one.

0 Likes

#54

The general rule I use is, if it needs to access the window, it should be a WindowCommand. TextCommands are generally restricted to things that manipulate the text buffer in some way, such as sorting the lines.

Agreed. Even writing simple things, such as a valid C identifier: [a-zA-Z_][a-zA-Z_0-9]* gets tedious quite quickly. I’ll look into in, but something along the lines of {{var}} seems reasonable.

[quote=“bathos”]

Thanks. I’ll try to put together the most minimal example that triggers it since the whole thing is pretty vast. Of course, I’m not at all certain that it isn’t something I’m just doing wrong. FWIW, any file will trigger the lockup, even an empty one.[/quote]

Thanks - minimal is nice, but don’t put too much effort into it. It’s a lot simpler on my end with a debugger in hand to faff about with that.

0 Likes

#55

+1 for substitution!

0 Likes

#56

Sorry for the delay in getting an example prepped (work came first) but it may not be needed anymore. Not sure what changes were made that would affect this, but I started experimenting again tonight and so far with the new build I’m not seeing any lock up! (fingers crossed)

The new variable system is great. I’m not sure how many languages define their identifier chars against Unicode properties, but this snippet for ES6 identifiers (not including invalidation of reserved words) may be useful to others:

variables:
  # Unicode character property shims
  ID_Continue: '{{ID_Start}}\p{Mn}\p{Mc}\p{Nd}\p{Pc}{{Other_ID_Continue}}'
  ID_Start: '\p{L}\p{Nl}{{Other_ID_Start}}'
  Other_ID_Continue: '··፩-፱᧚'
  Other_ID_Start: '℘℮゛゜'
  # Identifier
  identifier: '{{identifierStart}}{{identifierPart}}'
  identifierPart: '(?:\$_‍‍{{ID_Continue}}]|{{unicodeEscape}})*' # note! ZWJ & ZWNJ after _
  identifierStart: '(?:\$_{{ID_Start}}]|{{unicodeEscape}})'
  unicodeEscape: '\\u(?:\h{4}|\{\h+\})'

The ZWJ and ZWNJ will likely not C&P from a browser so you’ll have to add them in manually if your language permits them (they’re not as obscure as might be assumed by a Latin alphabet native – they’re critical for many scripts). The above definitions for ID_Start and ID_Continue are technically incomplete because the real defs subtract any overlaps from Pattern_Whitespace and Pattern_Syntax. Though possible using Oniguruma’s subtractive &&^whatever], I didn’t bother because I think it’s already thorough enough: those two properties are vast and the overlaps are very obscure, while the important thing to me was mainly to ensure the matches work for all human languages.

0 Likes

#57

Regarding the syntax test files, I would like to see a way of testing multiple columns of a line in a single query.

So, instead of

[code]

  • !foo “”

^ storage.type.tag-handle

^ storage.type.tag-handle

^ storage.type.tag-handle

^ storage.type.tag-handle[/code]

I would be doing

[code]

  • !foo “”

^^^^ storage.type.tag-handle[/code]

The first would obviously get infeasable for larger matches.

Another idea, that accompanies the above, is a way to also define columns where the scope selector should not match, e.g.

[code]

  • !foo “”

^^^^x storage.type.tag-handle[/code]

(open for symbol suggestions)

0 Likes

#58

with regards to the negative match, you can use the not operator in the selector:

- !foo ""
#  ^ - storage.type.tag-handle
0 Likes

#59

I know that, but by adding additional syntax you could do it in one line. It’s fairly common to specify a scope matching only for a certain region in a line and not matching anywhere else in the line – or at least in most of the remaining line. That’s why it would be accompanying the other change for allowing “more columns to be matched in a single line”.

0 Likes

#60

I’ll add that the ideal version supports testing any number of discontinuous positions in the line, and not just a range.

0 Likes

#61

Here is an example of how I imagine this to work:

[code]- plain - plain ? plain : plain: plain

<- -string

^ -string

^ -string

^ -string

^ -string

^ -string

^ -string

^ -string

^ -string

  • plain - plain ? plain : plain: plain
    #.^^^^^…^^^^^…^^^^^…^^^^^…^^^^^ string.unquoted.plain

<- -string.unquoted.plain

[/code]

^ represents a supposed match, . represents a supposed “not match” and (space) represents “don’t care”
I would use -, but that is already used for negative selectors and thus ambiguous.

Edit: This example is actually wrong, but you get the idea.

0 Likes

#62

I hadn’t really considered rolling in a NOT character with it, but that’s a nice touch. It would make for much more expressive test blocks when we can express a larger logical idea with a single line. +1

0 Likes

#63

Another thing I ran into with block scalars in YAML: There are situations where you can not test a certain line because a comment is not allowed in the following line. A way to test against line " - " would be nice, where I could provide “” in the comment line somehow.

Other example would be multi-line strings with a newline escape (backslash before newline). Obviously the next line has to be a string as well and can’t be a comment matching against the escaped newline in the first line.

0 Likes

#64

@jps is there anything planned for improving syntax tests to be able to test multiple positions in one line, as I proposed earlier?

I have written a couple syntax test files for YAML (and intend to write more tests for future syntaxes) and this change would reduce the noise from comment lines greatly.
It would also promote writing more precise tests just because it gets easier. I resorted to only test against meaningful posititions in a line because of that.

0 Likes