ST 3165
Yesterday I was writing a RegEx to capture identifers in a syntax definition I’m working on (using .sublime.syntax
format). I couldn’t manage to get Unicode escapes to work.
The intended RegEx was:
[a-zA-Z\u00e0-\u00fe&&[^\u00f7]]
… where the Unicode points in the range \u00e0-\u00fe
(except for \u00f7
) should be valid characters. I couldn’t get it to match the desired characters (tried different tweaks on the RegEx, but nothing).
But this version now works fine:
[a-zA-Zà-þ&&[^÷]]
… where I use the actual Unicode characters in the range, instead of the Unicode escape.
Why is this? Oniguruma documentation states that the \uHHHH
syntax is supported:
\uHHHH wide hexadecimal char (character code point value)
And, if my understanding is correct, ST has its own RegEx engine (faster) but also supports Oniguruma as a fallback engine for natively-unsupported RE features.
Note that both my RegExs above passed the “Syntax RegEx Compatibility Test” — simply, the former RegEx doesn’t match the desired characters.
NOTE — the source document I’ve used to test them is in ISO-8859-1 encoding; could this be the reason?
My guess is that encoding only applies to files on disk, and that ST internally alwasy represent strings as Unicode. But I admit lacking knowledge on how ST RegEx engine related to Unicode.