Sublime Forum

A Little Syntax Regex Issue

#1

I’ve been trying to improve my syntax file to cover a particular bug. Normally a token definition looks like signal foo : std_logic; and I capture that fine. However it’s possible to have more than one item on a line, so signal foo, bar : std_logic; is valid. I don’t handle that properly. I have a variable called identifier that contains the regex for a proper identifier construction.

Here’s what I have that’s close to working:

    - match: |-
        (?xi)
          ((signal)\s*)
          ({{identifier}})
          (\s*(,)\s*({{identifier}}))*
          (\s*(:))
      captures:
        2: storage.type.signal.vhdl
        3: variable.other.vhdl
        5: punctuation.separator.vhdl
        6: variable.other.vhdl
        8: punctuation.separator.vhdl

The logic on this is that it finds the keyword and the first (and mandatory) identifier. There is a variable capture that identifies commas and further identifiers but it’s possible for this to be zero. And finally it finds the colon separator and that’s where I push for the terminator semicolon. This actually looks like it’s matching the lexical elements pretty well until I drill down into the scopes.

signal alpha, beta, gamma, delta : std_logic;
^-- storage.type.signal.vhdl -- Correct
       ^-- variable.other.vhdl -- Correct
            ^-----------^ -- Incorrect.  Matches the line however does not match any captured group
                         ^-- punctuation.separator.vhdl -- Correct
                           ^-- variable.other.vhdl -- Correct
                                 ^-- punctuation.separator.vhdl -- Correct

So note the bit in the middle. The line as a whole is captured, and the first and the last variable is scoped correctly, however the bit in the middle doesn’t seem to match the capture group. I did some poking around and this seems pretty similar to the topic here: Syntax highlight capture scopes should apply to all captures of a given group

Am I running into a known behavior here with scopes not working well on repetitive capture groups?

If I am, then that’s going to push me into the more elaborate matching where I look for signal and then push and set until I hit the terminating semicolon.

0 Likes

#2

Not sure if this can be the problem, but instead of making a big regex that captures them all, you could split the patterns and use:

push:
  - include: a
  - include: b
  ...

It will make it easier to make regexes that work, and correct the ones that don’t work.

1 Like

#3

yes, you are

1 Like

#4

Yeah I’m experimenting with the “serial” method right now. And just saw Keith’s answer so I suspect I’ll have to pursue this more strongly. Thanks!

0 Likes

#5

My understanding is that generally (if not always?) regex engines only return the last capture group value when it is repeated. https://www.regular-expressions.info/captureall.html

0 Likes

#6

I would say generally, but not ‘always’. For instance the 3rd party Regex actually does retain multiple captures, but I am not sure if it is based off some other regular expression engine or if it is unique. For that reason that module has format template replace mode where you can access the different captures My replace{3[4]}.. So yeah, I think its a good general statement.

0 Likes

#7

Do you happen to know how Sublime’s regex engine handles captures? I suspect that it uses something like Laurikari’s tagged automata, which would make handling duplicate captures impractical.

In any case, mg979 is right that this sort of thing is best handled by adding more contexts:

main:
  - match: \bsignal\b
    scope: storage.type.signal.vhdl
    push:
      - signal-definition
      - signal-names

signal-names:
  - match: '{{identifier}}'
    scope: variable.other.vhdl
  - match: ','
    scope: punctuation.separator.list.vhdl
  - match: ':'
    scope: punctuation.separator.key-value.vhdl
    pop: true
  - match: (?=\S)
    pop: true

This should handle any number of signal names, and it should also work over newlines. (Disclaimer: I’m on vacation, this is my Chromebook, and I haven’t tested the code.)

1 Like

#8

I’m pretty sure that .NET lets you access multiple matches from capture groups

0 Likes

#9

Yeah, as soon as I knew I was in the same territory as repeated captures, I changed course into that more ‘tokenizing’ approach.

  signal-decl:
    - match: (?i)signal
      scope: storage.type.signal.vhdl
      push:
        - meta_scope: meta.statement.signal.vhdl
        - match: ({{identifier}})
          scope: variable.other.vhdl
        - match: (,)
          scope: punctuation.separator.vhdl
        - match: (:)
          scope: punctuation.separator.vhdl
          set:
            - match: (;)
              scope: punctuation.terminator.vhdl
              pop: true
            - include: basic-paren-group
            - include: stray-parens
            - include: reserved-words

This seems to work out alright, and it extended well to constant and variable declarations. Honestly I’ll probably end up going this way throughout but the file is pretty big at this point and it’ll be a gradual process as I find weaknesses in the matching design.

0 Likes