Sublime Forum

What's Sublime's new regex engine?

#1

I’ve heard that sublime has ditched Oniguruma regex engine in favor of a new one that is more performant.
What’s this new regex engine? is it open source?

1 Like

ST3 RegEx Engine Supports Look-Behind?
#2

https://www.sublimetext.com/blog/articles/sublime-text-3-build-3103

3103 also features a custom regex engine that significantly speeds up file loading and indexing.

it’s proprietary to ST3, not sure it has a name.

0 Likes

#3

Sublime Text build 3085 and newer contain a custom, non-backtracking regex engine for syntax highlighting only. It is heavily optimized for the type of matching that happens when lexing source files.

It is not open source, and does not contain some features present in Oniguruma, since it is a different style engine. Whenever possible, the Sublime Text regex engine will be used for a regex pattern in a .sublime-syntax or .tmLanguage file. If the regular expression uses constructs that are not supported by the new engine, Oniguruma will be used instead. We do not currently have any plans on removing the Oniguruma engine, so all existing .tmLanguage files will continue to be supported.

In general, the features not supported by the new engine can cause performance issues. These include:

  • Lookbehinds
  • Backreferences

However, there are some features which are not supported that don’t really fit with the style of engine:

  • Atomic subgroups
  • Possessive quantifiers

In general, Oniguruma supports a lot of different syntax not supported by other engines. Other than the features mentioned above, there may be other edge-case syntax (such as a negated posix character class that only PCRE and Oniguruma support, [[:^ascii:]]) that is unsupported.

In build 3109, we added the Regex Compatibility variant to the Syntax Tests build system. This will provide feedback about any regex patterns not supported by the new engine.

As a general rule, syntaxes that use only features supported by the new regex engine tend to be quite a bit faster. For instance, the JavaScript syntax in 3109 is about 4x as fast as the one in 3103. HTML has been improved by probably 20%, and there are more examples I can’t think of off of the top of my head. Since build 3098 we’ve had a build variant that will profile the current view using the syntax it is currently highlighted with.

However, depending on your regular expression constructs, it is still possible to make the new regex engine slow. For instance, doing extensive lookaheads at every character in a repeating group can definitely be slow. This is why measuring is useful, especially with large files.

Since the new engine is generally much faster, as we are improving the quality of the default package syntaxes, we are also ensuring that they do not use features unsupported in the new engine. In build 3109 when we added the compatibility build system variant, we also added support for a number of regex features that were previously unsupported.

Due to the nature of the engine, it is unlikely that we will implement lookbehinds or backreferences. The same is likely true for atomic subgroups and possessive quantifiers. That said, from spending a whole bunch of time working on syntaxes recently, I can attest that the explicit context control in .sublime-syntax files generally makes it possible to accomplish the same sort of lexing as these constructs would.

Sometimes this involves explicitly tracking the “state” of the language being lexed. For instance, to deal with regex literals and the division operator in JavaScript, we switch contexts after various operators and parentheses to ensure we don’t match the beginning of a regex literal as a division operator. Recently I’ve been working on C/C++, and there is work being done on tracking state to ensure we are properly detecting function definitions, even when spanning multiple lines.

11 Likes

Select whole word
Syntax include pop
RegEx Fails to Match Unicode Escaped Range
#4

@wbond Thanks for sheding some light on this issue. This was exactly what I needed to know.

0 Likes

#5

@wbond can you write a short article on tips and tricks of solving these regex compatibility issues for plugin owners. I’ve seen you are already trying to fix most of these issues on default packages. It would be nice if you can share the knowledge and wisdom you gained going through this process and fixing all these complicated regex code.
I’ve run the test on one of my old syntaxes that I converted and here are the top issues:

  1. \G is unsupported
  2. Look behind is not supported
  3. Negative look behind is not supported
  4. Named captures are not supported
  5. Back references are unsupported

It would be great if you give us some hints on how to tackle each of these problems.

2 Likes

#6

Thank you for clarifying this, I was pulling my hair out trying to get a replace regex to work! It would ‘find’ but not ‘replace’.

Looks like it didn’t because it used a look behind, but the look behind wasn’t strictly necessary so a little tweaking and bang! Saved me a ridiculous amount of time editing some legacy code!

I don’t mind it not working with look behind but maybe an error or warning message when attempting to replace would be worthwhile.

Anyways, it still saved me oodles of time! Thanks!

0 Likes

#7

This thread (and the regex engine) are strictly about syntax highlighting. ST uses boost pcre for find and replace.

You probably just had an error in your expression.

0 Likes

#8

mmm ok… but it would find and highlight a character but was entirely unable to replace it.

0 Likes

#9

find and replace with lookbehinds has been broken for a while: https://github.com/SublimeTextIssues/Core/issues/1150 / https://github.com/SublimeTextIssues/Core/issues/308

1 Like

#10

Thank you very much for this! This information is gold. I’m hard at work right now rewriting my CSS3 package in sublime-syntax. I’ve already made several changes based on this info.

I can attest that the explicit context control in .sublime-syntax files generally makes it possible to accomplish the same sort of lexing as these constructs would.

Unfortunately, I’ve been using lookbehind extensively, and removing it is breaking the world. Do you have any suggestions about how to fake lookbehind, or were you referring to the other regex features like backreferences?

UPDATE: I discovered an easy way to fake positive lookbehind.

# BEFORE
- match: (?<=hello)world

# AFTER
- match: hello
  push:
    - match: world
      pop: true
2 Likes

#11

I wouldn’t call that faking a lookbehind, but rather contextual parsing, which is what lookbhehinds were trying to accomplish also.

You can see examples of the power of these contexts by checking out the after-* contexts in JavaScript and Python. It is possible to determine what an ambiguous token represents by keeping track of what was parsed before it.

3 Likes