Sublime Forum

.sublime-syntax "pop backreference" bugs and suggestions

#1

I messed around with backreferences in the “end” pattern of tmLanguage syntaxes a lot, and I’m really happy this feature is still in sublime-syntax. That being said, I really hope some fixes can be made to them. If you don’t know what I’m talking about, the basic example of these special backreferences is this:

contexts:
  main:
    - match: (.)
      push: repeat_character
  repeat_character:
    - match: \1
      pop: true

This is not a normal backreference. It actually has some pretty weird properties, because it’s implemented by recompiling the pop regex each time, and using a string replacement on instances of \1.

This is a good thing, though. I actually enjoy the cool things you can do with these hacky backreferences, like matching nothing when it captured an empty-string, and matching some other characters when it’s non-empty.

Onto the bugs. First thing. Only \1 works as a hacky backreference. \2 and so on no longer refer to the stuff matched in the push regex. tmLanguage syntaxes still do the hacky replacement on \2 and so on. I really hope this is a bug and not intended, because it breaks compatibility with tmLanguage files a lot.

Second bug. Whenever a pop regex fails to compile, due to a hacky backreference making it invalid, Sublime just crashes. It would be really good if Sublime just treated the regex as matching anything, and outputted an error to the console, when it fails to compile.

Third thing, and this was also a bug with tmLanguage. The replacement string that goes into a hacky backreference is escaped to some extent before it goes into the pop regex, but not fully. This is a bug in itself. A much better idea is to escape all ASCII non-“word” characters. No matter what. Perl gets this right with its quotemeta function. In particular whitespace characters, and # and &, are not escaped and cause problems in certain contexts in a regex. Just escape everything like quotemeta does it.

And the last thing I’ve got is a suggestion. Overloading \1, \2, etc. as hacky backreferences is just bad design. It would be much better if you did it with something like {{1}} instead. Using the new variable syntax much better illustrates that it’s a replacement string, and not a real backreference.

An extension of this would be really useful. At times I’ve really wanted to refer to something captured not by the immediate parent, but by the parent of that, or the parent of that, and so on. Another useful thing would be to refer to a named capture in a hacky backreference. I’ve got some examples of a syntax you could use for all of this below.

{{1}} {{2}} etc. - regular hacky backreference.
{{context:1}} - capture 1 of the push regex in the context named context. If there are multiple of this context on the stack, the latest one should be used.
{{-2:1}} - capture 1 of the push regex “2 contexts ago”
{{-1:1}} - one context ago, so same as {{1}}
{{context:name}} - hacky backreference to the named capture called name in a context
{{-1:name}} - named hacky backreference to the last context. We need the -1:, because without it it’d look like a normal variable.

If the named or numbered context is not in the stack, then the replacement string should be an empty string. This should not be an error, but instead well-defined behaviour. This allows for a lot more flexibility, where a pop regex doesn’t know what its parent is, but can figure it out by having empty or non-empty hacky backreferences to different parents.

Thanks for reading.

1 Like

#2

Any update on this?

0 Likes

#3

Backrefs in pop actions other than \1 not working in some circumstances will be fixed in the next build. The syntax won’t be changing.

0 Likes

#4

I would really like to see the ability to pop multiple contexts at once, for example by using: pop: 2, or to be able to pop a context without matching anything first. (I have tried a blank match but it doesn’t seem to work reliably in some circumstances, I think when doing it for the top 2 contexts at the same match position for example.)

This would be useful for grammars where you must have an operator following a variable for example, but where sonetimes a closing bracket might be allowed because it is inside an opening bracket. Currently it just seems too complicated to still match invalid closing brackets in my experience.

I can try to get some examples to show if it would help - maybe I just need to rethink how I do my contexts and includes, matches, sets, pushes and pops etc. :slightly_smiling:

0 Likes

#5

@jps: there is another pop-related weirdness. In syntax documentation there is a mention for with_prototype that

The context stack may be in the middle of a JavaScript string, for example, but when the is encoutered, both the JavaScript string and main contexts will get poped off.

According to it pop should remove all contexts from stack that above the one pushed the prototype. But right now only top context will be removed.

1 Like

#6

This is only true if you use a look-ahead pop pattern, which will match on all pushed contexts until the original with_prototype push is reached. If you use a regular pattern, the characters are consumed and only one context is popped.

2 Likes

#7

So that’s how it was supposed to be used… That way here is a proposal from my side: add pop_all. Common case in syntax is that you’re matching a beginning of some sections (say opening bracket) than pushing a context that will process the section including its terminator (closing bracket). With current approach the terminator won’t be consumed. Not a big deal unless the section starts and ends with the same character. Regex in javascript is a good example.

1 Like

#8

I see that popping a specified number of contexts has already been requested: https://github.com/SublimeTextIssues/Core/issues/1042

0 Likes