Sublime Forum

[sublime-syntax] Allow \1 in patterns that don't pop

#1

Right now, in .sublime-syntax files, if in a context A a match pattern pushes a context B, and in B, a match pattern pops, then you can use \1 in the popping pattern to reference the contents of the first captured group of the content that pushed.

This is perfect to manage certaing constructs, but it’s very limited in usage. If it were extended so that any match pattern could use \1 to reference the first capture group in the match that pushed the current context, that’d greatly increase the capabilities of the system, e.g. by allowing us to define a python-style grammars, where a context is pushed for each new block, and said context is popped only when the indentation decreased (currently, the python cyntax can’t e.g. put a meta.class scope to encompass a whole class definition).

The only possibility to do this now, is to have many copies of the same base context that check for different levels of indentation, so e.g. instead of three primary contexts (main, class, function) I need to have potentially infinite ones (main, class function, main_1, class_1, function_1, … main_n, class_n, function_n), where 1…n is a range of allowed indentation. This needless repetition (and it’s associated impact on the memory footprint of the grammar, as well as the associated costs in therms of maintainability) would be removed if the suggestion above were to be implemented.

Here is how a toy indent-based grammar would look if the suggestion was implemented

main_0: - meta_content_scope: main_0 - match: '^(?!\t{0})' pop: true - match: '^(\t{0})(\s*)(?={{prefix}})' scope: invalid.illegal.gss.main_0.invalid-leading-whitespace push: [main_recursive, parse_block_opener] - match: '^(\t{0})(\s*)' scope: invalid.illegal.gss.main_0.invalid-leading-whitespace push: parse_normal_line - match: '(.+)' captures: 2: invalid.illegal.gss.main_0 main_recursive: - meta_content_scope: main_recursive - match: '^(?!\t\1)' #this is currently valid pop: true - match: '^(\t\1)(\s*)(?={{prefix}})' #this is currently invalid since the action in the pattern is not 'pop' scope: invalid.illegal.gss.main_recursive.invalid-leading-whitespace push: [main_recursive, parse_block_opener] - match: '^(\t\1)(\s*)' scope: invalid.illegal.gss.main_recursive.invalid-leading-whitespace push: parse_normal_line - match: '(.+)' captures: 2: invalid.illegal.gss.main_recursive

And this is what I currently need to do to emulate it:

  main_0:
    - meta_content_scope: main_0
    - match: '^(?!\t{0})'
      pop: true
    - match: '^(\t{0})(\s*)(?={{prefix}})'
      scope: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: [main_1, parse_block_opener]
    - match: '^(\t{0})(\s*)'
      scope: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: parse_normal_line
    - match: '(.+)'
      captures:
        2: invalid.illegal.gss.main_0
  main_1:
    - meta_content_scope: main_1
    - match: '^(?!\t{1})'
      pop: true
    - match: '^(\t{1})(\s*)(?={{opener_line}})'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: [main_2, parse_block_opener]
    - match: '^(\t{1})(\s*)'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: parse_normal_line
    - match: '^(\t*)(.+)'
      captures:
        2: invalid.illegal.gss.main_1
  main_2:
    - meta_content_scope: main_2
    - match: '^(?!\t{2})'
      pop: true
    - match: '^(\t{2})(\s*)(?={{opener_line}})'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: [main_3, parse_block_opener]
    - match: '^(\t{2})(\s*)'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: parse_normal_line
    - match: '^(\t*)(.+)'
      captures:
        2: invalid.illegal.gss.main_2
  main_3:
    - meta_content_scope: main_3
    - match: '^(?!\t{3})'
      pop: true
    - match: '^(\t{3})(\s*)(?={{opener_line}})'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: [main_4, parse_block_opener]
    - match: '^(\t{3})(\s*)'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: parse_normal_line
    - match: '^(\t*)(.+)'
      captures:
        2: invalid.illegal.gss.main_3
  main_4:
    - meta_content_scope: main_4
    - match: '^(?!\t{4})'
      pop: true
    - match: '^(\t{4})(\s*)(?={{opener_line}})'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: [main_5, parse_block_opener]
    - match: '^(\t{4})(\s*)'
      captures:
        2: invalid.illegal.gss.main_0.invalid-leading-whitespace
      push: parse_normal_line
    - match: '^(\t*)(.+)'
      captures:
        2: invalid.illegal.gss.main_4
...
1 Like

#2

I agree with the use case, but I think that the right way to go would be to replace the legacy (TextMate-derived) backreference-based system with a new one that’s more sensible. I don’t have a formal proposal, but I’ve talked about the possibilities.

Here’s an example of how Python implementation might work:

contexts:
  main:
    - match: ''
      push: block
      with_parameters:
        indent: ''

  block:
    - meta_scope: meta.block.python

    - match: ^\s*$ # do nothing, empty line
    
    - match: ({{$indent}}(?:\s+)) # increase indent
      push: block
      with_parameters:
        indent: \1

    - match: ^{{$indent}}(?=\S) # do nothing, same indent

    - match: ^ # decrease indent
      pop: true

Any context that used parameters would be dynamically recompiled using the slower Oniguruma engine (as is the case anyway when using backreferences). Other than that, the only overhead would be an additional parameter stack. In the typical case (no changed parameters), this should be negligible.

From an implementation standpoint, the parameter names could be translated to indices. Then, a parameter record would be an array of values. When a with_parameters rule is encountered, we can duplicate the record on top of the parameter stack and change the specified parameters. When the pushed/set context is popped, we pop the parameter stack. (We might need to add a flag to each record on the context stack indicating whether popping that context should pop the parameter stack.)

The {{$indent}} syntax is designed to resemble the existing variable system without colliding with it or with any other valid regexp. The dollar sign can be statically detected so that the syntax compiler won’t try to replace it with a syntax-wide variable.

In addition to enabling new features like indentation detection in Python and YAML, I think that this system would be easier to use and reason about than the old backreference behavior even in cases where the old behavior works, such as tag matching. Example:

contexts:
  main:
    - match: <(\w+)>
      scope: open-tag.xml
      push:
        - meta_scope: meta.tag.xml
        - include: tag-body
      with_parameters:
        tag_name: \1

  tag-body:
    - match: </({{$tag_name}})>
      scope: close-tag.xml
      pop: true
    - match: </(\w+)>
      scope: invalid.illegal.unmatched
    - include: main
0 Likes

#3

I really like your idea. May I suggest a syntactical tweak to it? Since you propose using named parameters, and currently named capture groups are not wholly supported by Sublime’s engine, I’d suggest using the clasical regex syntax for named groups to signal that you’re saving a parameter, and then a backreference to it to use it. So your last example would look like this:

contexts:
  main:
    - match: <(?'tag_name'\w+)>
      scope: open-tag.xml
      push:
        - meta_scope: meta.tag.xml
        - include: tag-body

  tag-body:
    - match: </(\k'tag_name')>
      scope: close-tag.xml
      pop: true
    - match: </(\w+)>
      scope: invalid.illegal.unmatched
    - include: main
0 Likes

#4

One thing I don’t like about the old backreference system is that it uses standard regexp syntax to represent special sublime-syntax-specific functionality. This can be confusing, and it could cause unexpected behavior. In addition, it’s not easy to parse correctly.

I think that using variable-like expressions makes more sense. The double-brace syntax does not resemble a standard regexp feature, and syntax developers expect text inside double braces to refer to a named variable outside the expression. In addition, it leaves open the possibility of Sublime eventually implementing named capture groups.

In addition to these concerns, which apply to the simple XML case, an explicit with_parameter key is more flexible. In the Python example, we use explicit with_parameter: { indent: '' }. It’s not clear to me how this would work with capture groups. (Also, hardcoded with_parameter rules could be pre-compiled to avoid the slow regexp engine.)

1 Like

#5

AFAIK, the current system does not use backtracking for back references in pop patterns but instead recompiles the pop pattern on the fly.

0 Likes

#6

Interesting. Do you remember the source?

0 Likes

#7

I’m pretty sure Jon said this on the forum somewhere, but I doubt I’ll be able to find it again (considering how weird the forum search is). It might have been before the new Syntax format even.

0 Likes

#8

not sure if you were referring to this part. anyway, the whole reply contains more info:

5 Likes

#9

Are there any news on this one?

I’m facing a similar use case, where I want to implement a syntax that has Python-style indentation rules. The with_parameters proposal above would come in very handy for such grammars: to me, it looks both powerful and easy to understand (as opposed to the current \1 pop-matching mechanism, which I find quite difficult to wrap my head around).

1 Like