Sublime Forum

Syntax Regexp Question

#1

Well, I think I may be dense here, but I haven’t been able to figure out a way to do this with the RE language. So a valid identifier has the following rules:

  1. Case insensitive
  2. Must begin with an alphabetic character.
  3. Characters allowed are alphanumeric characters plus underscore.
  4. May not end in an underscore.
  5. May not contain two or more successive underscores.

So, I’ve been tinkering with the following pattern in the python console:

[A-Za-z][[A-Za-z0-9]_]*[A-Za-z0-9]

In the syntax file I can use :alpha: and :alnum: because those appear to be defined by Oniguruma REs but I think the standard Python re module does not support those. Anyhow, so in that I have captured the beginning character validity, the end character validity, and one of the rules for the interior. However I have run into a problem. When testing this, the * isn’t greedy like it should be.

>>> import re
>>> pattern = r'[A-Za-z][[A-Za-z0-9]_]*[A-Za-z0-9]'
>>> str = "my_identifier_9"
>>> s = re.search(pattern, str)
>>> s.group()
'my_i'

So problems,

  1. The * doesn’t seem to be greedy like it should be. It should have tried to grab everything from y to the second _. I can’t say I’ve ever seen * not be super greedy.

  2. It doesn’t take into consideration a single character identifier, though I think that can be solved by adding a * after the final group to make sure it’s optional.

  3. I haven’t the foggiest notion how to exclude successive underscores. I could potentially use a negative lookahead (?!__) but I’m not entirely sure how to combine that with the rest of it since that * is supposed to be greedy and I need to check every instance of underscore. That is to say I think

    a-zA-Z[[a-zA-Z0-0]_][a-zA-Z]

would check for a __ after the first character but would permit double underscores after that. The interior group will be slurping up character by character. (I added the * at the end because I think that solves #2).

  1. Would it be more clear to use the (?i) at the beginning and just check for a-z throughout? I see (?i) used a lot in the syntax files because clearly you can’t use re.IGNORECASE as a flag as you can in direct Python RE. I guess this is more of a common practice question.

I suppose #1 and #3 are the biggest questions. I really can’t figure out why * isn’t greedy and how to avoid multiple _'s.

Though perhaps the inner group could be [[a-zA-Z0-9]_?]* but I’m not entirely sure if that’s valid. Seems vaguely sketchy. I could test it at console, except the searcher is not being properly greedy.

Actually it’s worse than I thought with the final character. I tried the following:

pattern = r'(?i)[a-z]\w*[a-z0-9]?'

And that actually matches all the bad cases. Since the ? at the end means it’s optional and the character matches _ in the \w group, it’ll match ‘another_bad_ident_’ quite happily.

Using \w* seems to be very greedy in a way that [[a-zA-Z0-9]]* wasn’t for some reason. Can’t say for sure why. I think in the prior example, it was treating the outer [] as () and it matched the first character, the next character, the next _ explicitly, then the next character and then stopped. Since it couldn’t find a letter combination again, it ended the match after “my_i”. Kind of a tricky thing.

If I check for single letter by itself as an option, then any other alternative has >1 characters and I can force a solid ending conclusion. So the following seems to satisfy most of the conditions except the successive underscores.

p = r'(?i)(?:\b[a-z]\b)|(?:\b[a-z]\w*[a-z0-9]\b)'

Kind of gross looking but so far it’s my best candidate.

0 Likes

#2

I’m kind of puzzled why your first regex only matches my_i too, but I believe this regex satisfies your five requirements:

[a-z]([a-z0-9]|_(?!_))*[a-z0-9]

it should be used as case-insensitive. The ([a-z0-9]|_(?!_))* part says: match zero or more times two alternatives: either [a-z0-9] or _ not followed by _.

You may also want to enclose it by the usual \b.

To account for single-character identifiers, you could use ? for the part after the first [a-z]

0 Likes

#3

I’m pretty sure something like this will work:

[[:alpha:]](?:[[:alnum:]]|_[[:alnum:]])*
1 Like

#4

Both good candidates. I’ll give that a try in console against some of my test cases and see how that works.

Out of curiosity is there a Python library that specifically supports the Oniguruma RE flavor? I did some web searching but didn’t come up with anything – it kept referring back to the Japanese page for Oniguruma. When I do testing or modules I use Python’s ‘re’ module but that’s a specifically Python flavor and I can’t use the :alpha: and :alnum: shortcuts (though I do get the (?P…) variations which is useful in code. If there was a module that either fully implemented Oniguruma I could import or if there was one that did a pattern translation from one flavor to the other, then I could do direct testing

0 Likes

#5

The regex package might support a few more constructs that Oniguruma also does. In any case, it supports \p{Letter} and similar.

I don’t think there is a package that specifically provides Oniguruma though.

FWIW, it’s also available as a Package Control dependency.

1 Like

#6

Beat me to it @Fichtefoll. I was literally typing up that answer and then your post popped up. regex at least handles unicode properties, and adds a lot of other cool stuff.

0 Likes

#7

That’s good to know. I did think about trying to use the set logical operators that Oniguruma supports, but then when it came to testing out different patterns I couldn’t actually get from point A to point B with ‘re’. Seems like the [:codeword:] construct is unique to Oniguruma however. Not difficult to get around and it might be better to pursue the more widely applicable case since I don’t believe I need to support unicode for the language though I should check the IEEE committee notes on the upcoming language revision to see if they added unicode token support.

As it turns out, both rwols and djspiewak’s variations work (after altering the latter to non-Oniguruma character groups). Thanks guys! (I did have to add \b around it but that seems pretty standard for all sorts of pattern matching. Otherwise it would match to an internal portion of a bad identifier like ‘9_bad_identifiers’.) Now onto further study of the context naming and trying to choose the proper taxonomy for the constructs.

0 Likes