Sublime Forum

RegEx Fails to Match Unicode Escaped Range

#1
ST 3165

Yesterday I was writing a RegEx to capture identifers in a syntax definition I’m working on (using .sublime.syntax format). I couldn’t manage to get Unicode escapes to work.

The intended RegEx was:

[a-zA-Z\u00e0-\u00fe&&[^\u00f7]]

… where the Unicode points in the range \u00e0-\u00fe (except for \u00f7) should be valid characters. I couldn’t get it to match the desired characters (tried different tweaks on the RegEx, but nothing).

But this version now works fine:

[a-zA-Zà-þ&&[^÷]]

… where I use the actual Unicode characters in the range, instead of the Unicode escape.

Why is this? Oniguruma documentation states that the \uHHHH syntax is supported:

\uHHHH       wide hexadecimal char  (character code point value)

And, if my understanding is correct, ST has its own RegEx engine (faster) but also supports Oniguruma as a fallback engine for natively-unsupported RE features.

Note that both my RegExs above passed the “Syntax RegEx Compatibility Test” — simply, the former RegEx doesn’t match the desired characters.

NOTE — the source document I’ve used to test them is in ISO-8859-1 encoding; could this be the reason?

My guess is that encoding only applies to files on disk, and that ST internally alwasy represent strings as Unicode. But I admit lacking knowledge on how ST RegEx engine related to Unicode.

0 Likes

#2

I’m not sure if it applies to ranges (I’ve never used it that way in a syntax before) but I’ve used the following to match single unicode characters in syntaxes, so you might want to try that variant and see how it works for you.

0 Likes

#3

Thanks @OdatNurd, I confirm that I’ve tried this in my syntax file and it works:

a-zA-Z\x{00E0}-\x{00FE}&&[^\x{00F7}]

It passed the RegEx compatibility test.

So, unlike the Oniguruma syntax (ie: \uHHHH) in ST we should use \x{HHHH}.

Good to know that both using the actual Unicode char or thei Hex code work in ranges. When a character is visibly representable the former might be more intuitive, but in all other cases the escaped hex Unicode point syntax is fundamental.

0 Likes

#4

That understanding isn’t correct. Maybe it was true once, but when I’ve drilled down into the more obscure and esoteric features of the RE syntax, they don’t correspond to Oniguruma at all.

0 Likes

#5

I see. In fact I’ve read that in a post dating back to 2016, referring to ST 3109:

One of the disadvantages of forums as a source of information is that it’s difficult to check if the info you come across is still up to date.

0 Likes

#6

We use Oniguruma 5.9 as our fallback. While sregex doesn’t support all features in Oniguruma, our target is compatible syntax. The regexes should result in the same matches and captures for syntax that works in both. You can trigger Oniguruma by having at least one match in a context that uses a feature that the Regex Compatibility build result marks as incompatible.

Issues where the result is different should be considered a bug.

0 Likes

#7

Here is the link to the docs for Oniguruma 5.9: https://github.com/kkos/oniguruma/blob/5.9.6/doc/RE#L26. You’ll notice there is no \u syntax. My hunch is that was added in Oniguruma 6 at some point.

1 Like

#8

Thanks for the link, that clarifies a lot, but it still leaves some ambiguities, such as what build options are specified.

(?s) works so I’m going to guess ONIG_SYNTAX_PERL but ONIG_OPTION_CAPTURE_GROUP I’m not sure about.

0 Likes