Sublime Forum

Named groups/captures in syntax definitions

#1

I’ve come to realise that named groups (e.g. (?'name'texttomatch), (?<name>texttomatch) cannot be used in syntax definitions so that later we can use them as captures like name: some_scope instead of 1: some_scope.

This would be helpful in improving readability of long regexes and avoiding errors in group counting during syntax developement.

1 Like

#2

ST’s sregex engine currently doesn’t support named captures (see the Syntax Tests - Regex Compatibility build system variant), and I doubt this functionality would get added for Oniguruma, as ST is moving away from it.

To make life easier with numbered capture groups, consider installing https://packagecontrol.io/packages/PackageDev, which will highlight the capture group:

3 Likes

#3

Sort of a side question, just looking at that example. I’ve always been fuzzy about the naming of groups with nested capture groups. I’ve always found it useful though to count the number of opening parens and that seems to match. So in this case, it kind of looks like the final punctuation.definition.link.end.markdown should be capture group 5. I count:

  1. (<) should be 1
  2. ( ..... ) is the highlighted group and is 2.
  3. (?:mailto:) I think is 3.
  4. (\.[-a-z0-9]+) I think this is the missing one?
  5. (>) And then if there’s a missing one, should this one be capture 5?
0 Likes

#4

The third group you are counting -and any group with ?: is non-capturing.

0 Likes

#5

Aha. Yeah that would probably do it. Thanks.

0 Likes

#6

Thanks for the insight and the sugestion!

Out of curiousity, what engine are they moving towards? I hope that other engine offers more features like named groups and possessive quantifiers not being slower than normal greedy quantifiers…

2 Likes

#7

Oniguruma is the old engine. It’s a general-purpose regexp engine that supports features like backreferences that aren’t actually “regular” in the technical sense. It uses (I presume) a backtracking implementation that can run into performance issues in corner cases. Possessive quantifiers make it easier to avoid these cases, and they fit naturally into a backtracking engine.

The new engine, sometimes known as “sregex” (but not the same as the open-source engine of that name) was written specifically for Sublime Text. It is optimized for use with Sublime’s syntax highlighting system. It is a true regular expression engine, meaning that it does not implement backreferences or other non-regular features. I presume that it is implemented using state machines, which work in guaranteed linear time and can be combined to match multiple regexps simultaneously. Possessive quantifiers would not make such an engine run any faster.

Documentation for the engine is sparse; my understanding of it is based on various statements from the team and on my own knowledge of regexp engine implementation. The big features that will almost certainly never be implemented are backreferences and lookbehind. These features are fundamentally incompatible with the design of the new engine. It’s possible that named captures could be implemented someday, because that’s not really part of the matching engine itself but could be handled early when processing a regexp. Possessive quantifiers could be implemented (I think), but because they mostly exist to work around performance problems in backtracking parsers there is no reason to use them with the new engine in the first place.

There’s a built-in build system: “Syntax Tests - Regex Compatibility”. This will tell you if any of the regexps in your syntax are incompatible with the new engine. Such regexps must be processed using the older, slower engine. Because you’re using backreferences to account for indentation, it is impossible for the syntax you’re working on to be 100% compatible with the new engine. I believe (but have not verified) that any context with only compatible regexps should still be fast. Fundamentally, Sublime’s parser is only designed to recognize deterministic context-free languages, which can’t handle indentation. In some cases, you can use backreferences to handle such things. This is a trade-off between functionality and performance.

4 Likes

Syntax highlight capture scopes should apply to all captures of a given group
#8

Thank you very much for the thorugh info. Seems like I’m stuck with the slower engine then, since I’m extensively using backreferences then, so in that case I might as well use possessive quantifiers freely. Will have it in mind for when I work on murepy regular languages. Tbank you very much again.

0 Likes