I messed around with backreferences in the “end” pattern of tmLanguage syntaxes a lot, and I’m really happy this feature is still in sublime-syntax. That being said, I really hope some fixes can be made to them. If you don’t know what I’m talking about, the basic example of these special backreferences is this:
contexts:
main:
- match: (.)
push: repeat_character
repeat_character:
- match: \1
pop: true
This is not a normal backreference. It actually has some pretty weird properties, because it’s implemented by recompiling the pop regex each time, and using a string replacement on instances of \1
.
This is a good thing, though. I actually enjoy the cool things you can do with these hacky backreferences, like matching nothing when it captured an empty-string, and matching some other characters when it’s non-empty.
Onto the bugs. First thing. Only \1
works as a hacky backreference. \2
and so on no longer refer to the stuff matched in the push regex. tmLanguage syntaxes still do the hacky replacement on \2
and so on. I really hope this is a bug and not intended, because it breaks compatibility with tmLanguage files a lot.
Second bug. Whenever a pop regex fails to compile, due to a hacky backreference making it invalid, Sublime just crashes. It would be really good if Sublime just treated the regex as matching anything, and outputted an error to the console, when it fails to compile.
Third thing, and this was also a bug with tmLanguage. The replacement string that goes into a hacky backreference is escaped to some extent before it goes into the pop regex, but not fully. This is a bug in itself. A much better idea is to escape all ASCII non-“word” characters. No matter what. Perl gets this right with its quotemeta
function. In particular whitespace characters, and #
and &
, are not escaped and cause problems in certain contexts in a regex. Just escape everything like quotemeta
does it.
And the last thing I’ve got is a suggestion. Overloading \1
, \2
, etc. as hacky backreferences is just bad design. It would be much better if you did it with something like {{1}}
instead. Using the new variable syntax much better illustrates that it’s a replacement string, and not a real backreference.
An extension of this would be really useful. At times I’ve really wanted to refer to something captured not by the immediate parent, but by the parent of that, or the parent of that, and so on. Another useful thing would be to refer to a named capture in a hacky backreference. I’ve got some examples of a syntax you could use for all of this below.
{{1}}
{{2}}
etc. - regular hacky backreference.
{{context:1}}
- capture 1 of the push regex in the context named context
. If there are multiple of this context on the stack, the latest one should be used.
{{-2:1}}
- capture 1 of the push regex “2 contexts ago”
{{-1:1}}
- one context ago, so same as {{1}}
{{context:name}}
- hacky backreference to the named capture called name
in a context
{{-1:name}}
- named hacky backreference to the last context. We need the -1:
, because without it it’d look like a normal variable.
If the named or numbered context is not in the stack, then the replacement string should be an empty string. This should not be an error, but instead well-defined behaviour. This allows for a lot more flexibility, where a pop regex doesn’t know what its parent is, but can figure it out by having empty or non-empty hacky backreferences to different parents.
Thanks for reading.