Sublime Forum

Saved Captures, Multiple Pops

#1

Both of these suggestions arise from trying to deal with verbose languages with prologue sections and statement sections in large language constructs.

Suggestion 1. If there was a syntax file language construct that would save the results of a capture match into a variable, that could be very handy. To handle large body constructs, it seems to take three contexts. The initial structure matching context, the declarative context which then sets the statements context which ends at the end of the structure. This makes matching the original identifier impossible because we’ve lost that regexp match from the first context. If the match could be preserved over multiple contexts, then I could check for that. This would aid anything where a matching word needs to be found through multiple levels of pushed or set context.

Suggestion 2. Allow pop to take an integer argument that specifies the number of contexts to pop. Using anonymous contexts with push is attractive to keep contexts tidy and not clutter the namespace. However you can’t do multiple sectioned structures without set and you can’t use set with anonymous contexts. If you could pop twice, you could handle tiered constructs with anonymous contexts and push (and perhaps clear_scopes).

EDIT: Having now done this several times, it is actually possible to use set with anonymous contexts. However I still think having the ability to pop more than once might potentially be useful.

1 Like

#2

There is a rather hacky way (but not entirely in a bad way) to do multiple pops; namely, you can do a short lookahead for both the top context and the one under that. This way, the parser won’t consume tokens, but pop anyway. The top context pops off, and the one under that now does the lookahead again, and matches again, thus pops off too.

I agree that “multi-popping” could be useful, but I fear that those kinds of contexts are not very reusable. They’ll probably need to work together with other contexts beneath them in order to not accidentally pop too much.

0 Likes

#3

there is a feature request for this here:

if you have a match pattern that pushes a context, the capture groups are available in the match patterns that will pop the context - otherwise it wouldn’t be possible to handle heredocs

0 Likes

#4

So, what are the rules for captures, when they are created, and destroyed, how their scope is handled, etc? Do they persist through a push and then a set afterwards? What happens if you are matching other minor stuff inside a construct – does it overwrite the numbered capture?

0 Likes

#5

nope, which makes some things extremely difficult to lex at the moment

you shouldn’t use backreferences in syntax definitions anyway (as they reduce performance - see the Syntax Tests - Regex Compatibility variant), but you’ll find you can’t use backreferences in pop patterns (at least where there was a capture in the corresponding push pattern) because the backreferences get expanded to the captured text from the push

0 Likes

#6

Yeah I’ve had to give up matching identifiers on most things because I want to have greater scope identification and control for this language. The TextMate syntax definition is alright in general and Brian Padolino put a lot of work into it and it does match the trailing identifier in many instances but it makes for very complex patterns. My rewrite of this scopes things with more granularity, but then I lose out on identifier match.

I don’t know that I follow your final comment though because I have seen this work in Padolino’s syntax. For example, the language has the following construct:

architecture <valid_identifier_for_architecture_name> of <valid_identifier_for_entity> is
    <block declaration_pattern>
begin
    <concurrent statements>
end [architecture] [<valid_identifier_for_architecture>];

The Padolino construct looks like this (after being translated into YAML – it was originally the TextMate XML format):

  architecture_pattern:
    - match: |-
        (?x)

                    # The word architecture $1
                    \b((?i:architecture))\s+

                    # Followed up by a valid $3 or invalid identifier $4
                    (([a-zA-z][a-zA-z0-9_]*)|(.+))(?=\s)\s+

                    # The word of $5
                    ((?i:of))\s+

                    # Followed by a valid $7 or invalid identifier $8
                    (([a-zA-Z][a-zA-Z0-9_]*)|(.+?))(?=\s*(?i:is))\b

      captures:
        1: keyword.language.vhdl
        3: entity.name.type.architecture.begin.vhdl
        4: invalid.illegal.invalid.identifier.vhdl
        5: keyword.language.vhdl
        7: entity.name.type.entity.reference.vhdl
        8: invalid.illegal.invalid.identifier.vhdl
      push:
        - meta_scope: meta.block.architecture
        - match: |-
            (?x)
                        # The word end $1
                        \b((?i:end))

                        # Optional word architecture $3
                        (\s+((?i:architecture)))?

                        # Optional same identifier $6 or illegal identifier $7
                        (\s+((\3)|(.+?)))?

                        # This will cause the previous to capture until just before the ; or $
                        (?=\s*;)

          captures:
            1: keyword.language.vhdl
            3: keyword.language.vhdl
            6: entity.name.type.architecture.end.vhdl
            7: invalid.illegal.mismatched.identifier.vhdl
          pop: true
... (there are a lot of - includes: after this for various lexical structures)

If I’m reading your line correctly, this shouldn’t work because \3 would be referencing the 3 from the pushed match. However it does seem to work, though there are not difference scopes for the prologue block and the statements block, which means technically it’ll match on a lot of things it shouldn’t in places it shouldn’t.

My variation is somewhat simpler but all I do is check for valid identifier construction, and cannot ensure that the identifier matches: I did it with three named contexts, but that was the first one that I’ve done with a prologue and I’ve subsequently done some others with anonymous contexts. Anyhow, whole point of the feature suggestion was that it’d be nice to be able to accomplish both tasks – the greater granularity on scoping the lexical elements and also be able to provide feedback to the author if they make an invalid closing construct.

  # The architecture needs to be 3 contexts to correctly get the scope correct
  # for the various structures.
  architecture-begin:
    - match: '(?i)^\s*(architecture)\s+({{identifier}})\s+(of)\s+({{identifier}})\s+(is)'
      captures:
        1: storage.type.architecture.vhdl
        2: entity.name.architecture.vhdl
        3: keyword.other.vhdl
        4: entity.name.entity.vhdl
        5: keyword.declaration.vhdl
      push: architecture-declarations

  # Note: 'set' is used because push will set us two into the stack and I
  # cannot pop twice.
  architecture-declarations:
    - meta_scope: meta.block.arch-declarations.vhdl
    - include: block-declarative-items
    - match: '(?i)\b(begin)\b'
      captures:
        1: keyword.declaration.vhdl
      set: architecture-statements

  architecture-statements:
    - meta_scope: meta.block.arch-statements.vhdl
    - include: concurrent-statements
    - match: '(?i)^\s*(end)\s+(architecture)?\s+({{identifier}})?\s*(;)'
      captures:
        1: keyword.declaration.vhdl
        2: storage.type.architecture.vhdl
        3: entity.name.architecture.vhdl
        4: punctuation.terminator.vhdl
      pop: true
0 Likes

#7

I wonder if it would be possible to sensibly specify and implement this behavior for the new regexp system. My guess is no, because you’d have to dynamically compile regexps while parsing. You could cache them, but then you’d still be running multiple regexps per context. Plus, it would be hard to avoid parsing the entire remainder of the file when you have dynamic contexts like this.

0 Likes