Sublime Forum

Details of Oniguruma Flavor

#1

I’m having great fun with the new sublime-syntax! But I’ve run into a little surprise:

Sublime tells me that the character class \p{ID_Start} is an invalid character property name.

From what I understand, Sublime uses Oniguruma, yet Oniguruma has no trouble understanding all Unicode character property names, even those which don’t have shortcut aliases. In addition to saying so on the Oniguruma site, I can see it myself right here:

rubular.com/r/odBhEO55Ls

Is Sublime using an older implementation? Are there sub-types of Oniguruma, and if so, which one is this and is there documentation on what it actually supports?

0 Likes

#2

The thing you linked to doesn’t have an underscore in “ID Start”. Is that maybe the problem?

0 Likes

#3

Unfortunately, that’s not the problem, but thanks for checking it out. Oniguruma allows Unicode’s formal character property names to be specified with space or underscore. Sublime doesn’t recognize either. I also tried “Id_Start”, “Id Start”, “id_start” and “id start”.

Fortunately it’s possible to replicate this using a collection of other two-letter property aliases and some additional explicit inclusions. The resulting regex is longer, of course, but hardly the worst I’ve dealt with.

On the other hand it does seem notable to me that it’s not available. Not so much because it suggests we can’t rely on Oniguruma’s documentation to know what will be valid (trial and error is pretty painless, since fantastically Sumblime live-reloads the syntax definition any time you save), but mainly because the particular property classes in question – ID Start, ID Continue, XID Start and XID Continue – were created to assist in tasks related to programming language syntax. These groups correspond to the typical set of characters that are legal in identifiers (in languages which allow Unicode identifiers), give or take a few characters you may need to specify in addition. Pretty useful for writing syntax matching patterns!

0 Likes

#4

Oniguruma syntax is detailed at geocities.jp/kosako3/oniguruma/doc/RE.txt. I’m not aware of “ID_Start”, but it’s not mentioned in the above document, nor does the string appear in the Oniguruma source.

0 Likes

#5

Well, it is mentioned:

* \p{property-name}
* \p{^property-name}    (negative)
* \P{property-name}     (negative)

But then they go on to list values. I suppose you could interpret that as an enumeration? However I would have assumed that list is short only for practical reasons. It shows Oniguruma-specific aliases, the ones corresponding to scripts, and a few other commonly used examples. There are many other character properties, and it makes sense not to list them all in a doc on regex. “Character properties” come from the Unicode standard so that is always the canonical resource for their names and definitions. In implementations of Oniguruma I’ve used previously, you can refer to any of them, not just the subset listed there, using \p{}. I wouldn’t have expected any except the Oniguruma-specific aliases to appear hardcoded in the source since this data would typically come from an outside source (system, library) since it expands with time (though characters never lose a property previously assigned, to prevent breaking changes – when Unicode makes a mistake, they have to invent a whole new property class). I guess we could figure out if they are hardcoded by looking for Bopomofo or whatever to see if that appears in the source :smile:

It’s possible that originally that was intended as an enumaration and it really did not support any others. All I know is that for at least since I first encountered Oniguruma a few years ago, character properties not in that list seem to work fine everywhere else. That’s why I’m wondering what the specific implementation/version in use was.

0 Likes

#6

ID_Start is provided by the Ruby version of Oniguruma, but not the standalone version of Oniguruma. My hunch for the reason behind this is that ID_Start is a derived property (vs a simple one), and perhaps the code that was used to derive the Oniguruma character tables didn’t support derived properties. The Ruby version provides its own set of character tables, rather than using the standard Onig ones.

FWIW as of build 3084, Oniguruma 5.9.6 is used (the latest at the time of writing).

0 Likes

#7

Thanks, that’s illuminating. I hadn’t realized Ruby’s Oniguruma isn’t “vanilla.” That explains it.

(Not directly related, but I’ll add: I’m about 1/3 through my first sublime-syntax definition and it’s awesome now that I’m getting the hang of it – it’s possible to disambiguate things that always had to be conflated before, and the accuracy of matches you can achieve even in complex / weirdly nested syntax is awesome. This is my new favorite Sublime feature.)

0 Likes

#8

Glad you like it! I also wasn’t aware of the difference between the Ruby and Vanilla versions of Onig, it’s been worthwhile to dig up though.

0 Likes