Well, I think I may be dense here, but I haven’t been able to figure out a way to do this with the RE language. So a valid identifier has the following rules:
- Case insensitive
- Must begin with an alphabetic character.
- Characters allowed are alphanumeric characters plus underscore.
- May not end in an underscore.
- May not contain two or more successive underscores.
So, I’ve been tinkering with the following pattern in the python console:
[A-Za-z][[A-Za-z0-9]_]*[A-Za-z0-9]
In the syntax file I can use :alpha: and :alnum: because those appear to be defined by Oniguruma REs but I think the standard Python re module does not support those. Anyhow, so in that I have captured the beginning character validity, the end character validity, and one of the rules for the interior. However I have run into a problem. When testing this, the * isn’t greedy like it should be.
>>> import re
>>> pattern = r'[A-Za-z][[A-Za-z0-9]_]*[A-Za-z0-9]'
>>> str = "my_identifier_9"
>>> s = re.search(pattern, str)
>>> s.group()
'my_i'
So problems,
-
The * doesn’t seem to be greedy like it should be. It should have tried to grab everything from y to the second _. I can’t say I’ve ever seen * not be super greedy.
-
It doesn’t take into consideration a single character identifier, though I think that can be solved by adding a * after the final group to make sure it’s optional.
-
I haven’t the foggiest notion how to exclude successive underscores. I could potentially use a negative lookahead (?!__) but I’m not entirely sure how to combine that with the rest of it since that * is supposed to be greedy and I need to check every instance of underscore. That is to say I think
‘a-zA-Z[[a-zA-Z0-0]_][a-zA-Z]’
would check for a __ after the first character but would permit double underscores after that. The interior group will be slurping up character by character. (I added the * at the end because I think that solves #2).
- Would it be more clear to use the (?i) at the beginning and just check for a-z throughout? I see (?i) used a lot in the syntax files because clearly you can’t use re.IGNORECASE as a flag as you can in direct Python RE. I guess this is more of a common practice question.
I suppose #1 and #3 are the biggest questions. I really can’t figure out why * isn’t greedy and how to avoid multiple _'s.
Though perhaps the inner group could be [[a-zA-Z0-9]_?]* but I’m not entirely sure if that’s valid. Seems vaguely sketchy. I could test it at console, except the searcher is not being properly greedy.
Actually it’s worse than I thought with the final character. I tried the following:
pattern = r'(?i)[a-z]\w*[a-z0-9]?'
And that actually matches all the bad cases. Since the ? at the end means it’s optional and the character matches _ in the \w group, it’ll match ‘another_bad_ident_’ quite happily.
Using \w* seems to be very greedy in a way that [[a-zA-Z0-9]]* wasn’t for some reason. Can’t say for sure why. I think in the prior example, it was treating the outer [] as () and it matched the first character, the next character, the next _ explicitly, then the next character and then stopped. Since it couldn’t find a letter combination again, it ended the match after “my_i”. Kind of a tricky thing.
If I check for single letter by itself as an option, then any other alternative has >1 characters and I can force a solid ending conclusion. So the following seems to satisfy most of the conditions except the successive underscores.
p = r'(?i)(?:\b[a-z]\b)|(?:\b[a-z]\w*[a-z0-9]\b)'
Kind of gross looking but so far it’s my best candidate.