Sublime Forum

Syntax Definitions: How to force pop out of the stack?

#1

While working on syntax definitions, I often stumble upon a problem which I haven’t managed to find a clean solution for it: when the last element is an optional element, how can I force a pop out of the stack?

Example, I am working on syntax that ends all its (meta) blocks with the END keyword, which may be followed by an optional identifier and the optional . terminator. Example:

# ... some context in the stack
  - match: (?i)\bEND\b
    scope: keyword.control.alan
  - match: '\.`
    scope: punctuation.terminator.alan
    pop: true

Currently, the only workaround that came to mind is this one:

  - match: (?i)\bEND\b
    scope: keyword.control.alan
  - match: '\.|$`
    scope: punctuation.terminator.alan
    pop: true

… which is rather hugly as it presumes that there will be no other kywords/statements in the same line (which is not an absolute certainity, for the language ignores white space, including EOLs); also, it creates a content-less scope for a non existing punctuation.terminator (so, it’s double hugly).

Beside the above example, this is a problem I’m often faced with, ie, the need to force pop my way out of a stacked context. Which is the proper way to go about it? This syntax raises an error:

  - pop: true

… so, obviously pop can’t stan on its own but must backed by some match statement. I guess the solution will be in some RegEx pattern that always matches with zero-lenght. Something like:

  - match: .?
    pop: true

Any advice on this? which would be a clean approach/solution to the problem?

0 Likes

#2

an empty match pattern will do the trick

- match: ''
  pop: true

this is commonly used in the syntaxes that ship with ST.

3 Likes

#3

Try something like this:

contexts:
  else-pop:
    - match: (?=\S)
      pop: true

  some-context:
    - match: (?i)\bEND\b
      scope: keyword.control.alan
      set:
        - period
        - end-identifier

  period:
    - match: \.
      scope: punctuation.terminator.alan
      pop: true
    - include: else-pop

  end-identifier:
    - match: '{{identifier}}'
      scope: whatever
      pop: true
    - include: else-pop

When some-context encounters END, it will pop some-context and push two other contexts on the stack. The topmost of those will match an identifier, and the next will match the period. If either is missing, that context will pop itself. else-pop is a utility context I use in virtually all of my syntaxes.

This is the method I always use and recommend for handling sequences like the one you describe. The components are automatically optional, it works over newlines, and it fails very gracefully.

0 Likes

#4

I’m running in circles, something is going wrong and I either loose scope prematurely or don’t manage to close it.

The problem seems to be related to when I use multiple set or push at once. My understanding is that each context is pushed or set, and when it pops the next one is pushed/set. Is this right?

Because the problem is either there or due to some unconsumed whitespace that is left over after popping and prevents correctly exiting some context.

The situation is getting complicated because I’m tring to create reusable contexts for class identifiers — they will be used also in instances, and since they are quite long (having the quoted and unquoted version each) it would be ideal to have them as reusable context, especially if I need to change them (which I might).

This is the current status of the code (which doesn’t end the meta scope after the terminator):

  class:
    - match: (?i)^\s*(EVERY)\b
      scope: storage.type.class.alan
      set: [class_body, class_identifier]
      # set: [any_POP, class_body, class_identifier]
    - include: immediate_POP

  
  class_body:
    - meta_scope: meta.class.alan
    - match: (?i)\bIsA\b
      scope: storage.modifier.extends
      push: inherited_class_identifier
    - match: (?i)\bEND\s+EVERY\b
      scope: keyword.control.alan
      push:
      - optional_terminator
      - optional_class_tail_identifier

  optional_terminator:
  - match: '\.'
    scope: punctuation.terminator.alan
    pop: true
  - include: any_POP

  any_POP:
    - match: '(?=\S)'
      pop: true

  immediate_POP:
    - match: ''
      pop: true

  class_identifier:
    # Unquoted ID
    - match: '{{ID}}'
      scope: entity.name.class.alan
      pop: true
    # Quoted ID
    - match: "(')((?:''|[^'])*)(')"
      captures:
        1: punctuation.definition.string.begin.alan
        2: entity.name.class.alan
        3: punctuation.definition.string.end.alan
      pop: true

  optional_class_tail_identifier:
    # Unquoted ID
    - match: '{{ID}}'
      scope: entity.name.class.tail.alan
      pop: true
    # Quoted ID
    - match: "(')((?:''|[^'])*)(')"
      captures:
        1: punctuation.definition.string.begin.alan
        2: entity.name.class.tail.alan
        3: punctuation.definition.string.end.alan
      pop: true
    # - include: immediate_POP
    - match: ''
      pop: true

  inherited_class_identifier:
    # Unquoted ID
    - match: '{{ID}}'
      scope: entity.other.inherited-class.alan
      pop: true
    # Quoted ID
    - match: "(')((?:''|[^'])*)(')"
      captures:
        1: punctuation.definition.string.begin.alan
        2: entity.other.inherited-class.alan
        3: punctuation.definition.string.end.alan
      pop: true

Everything is captured and scoped as expected … but after the terminal . the source.alan meta.class.alan scope is still active.

0 Likes

#5

Most problems that come up when using the stack can be solved by using the stack even more. One thing I typically do is confine meta scopes to single-purpose contexts. Example:

  else-pop:
    - match: (?=\S)
      pop: true

  immediately-pop:
    - match: ''
      pop: true

  meta-class:
    - meta_scope: meta.class.alan
    - include: immediately-pop

  main:
    - match: (?i)\bEVERY\b
      scope: storage.type.class.alan
      push:
        - meta-class
        - class-body
        - inheritance-clause
        - class-name

  class-name:
    - match: (')(.*?)(')
      scope: meta.string.alan
      captures:
        1: string.quoted.single.alan punctuation.definition.string.begin.alan
        2: entity.name.class.alan
        3: string.quoted.single.alan punctuation.definition.string.end.alan
      pop: true
    - include: else-pop

  class-body:
    - match: (?i)\bEND\b
      scope: keyword.control.alan
      set:
        - optional_terminator
        - optional_class_tail_identifier
        - end-every-keyword
2 Likes

#6

I’ve finally managed to achieve the goal. It wasn’t simple (actually, a nightmare), but thanks to your advise I’ve managed to accomplish it.

The problem was indeed tied to uncosumed whitespace: the else_POP and immediate_POP tricks to force your way out of a stacked context can cause a premature pop if there is still some whitespace floating around which isn’t captured by the various included contexts. At the end, I had to ensure that the pattern that would set the context on the stack would also eat up (and dicard) any trailing whitespace.

Also, another problem was the END EVERY construct having two optional trailing keywords (identifier and dot-terminator). To prevent loose scopes floating about, I had to implement an extra check for an END EVERY statement followed by only whitespace (ie: neither ID nor terminator).

Here I publish the final code of how I’ve managed it (just in case it might turn out a useful reference to others with similar problems):

  class:
    - match: (?i)\bEVERY\b
      scope: storage.type.class.alan
      set: [class_body, class_identifier]
  
  class_body:
    - meta_scope: meta.class.alan
    - include: class_head
    - include: class_tail
  class_head:
    - match: (?i)\bIsA\b
      scope: storage.modifier.extends
      push: inherited_class_identifier
    # TODO: inheritance

  class_tail:
    # ===========================
    # END EVERY => no ID & no `.`
    # ===========================
    # When END EVERY is not followed by neither ID or dot, we must capture it
    # separately to avoid stray scopes after it...
    - match: (?i)\b(END\s*EVERY)\b(?=\s*)$
      captures:
        1: keyword.control.alan
      pop: true
    # ==========================
    # END EVERY => ID and/or `.`
    # ==========================
    - match: (?i)\b(END\s*EVERY)\b\s* # <= must consume all whitespace!
      captures:
        1: keyword.control.alan
      set:
        - meta_content_scope: meta.class.alan
        - include: class_tail_identifier
        - include: terminator
        - include: force_POP

  terminator:
    - match: '\.'
      scope: punctuation.terminator.alan

The lesson I’ve learned from tackling with this problem is:

  • Always beware of any leftover whitespace from regex match patterns:
    • Try to consume trailing whitespace by capturing it with a discarded group
    • captures is better than scope because it allows to add extra discarded groups for testing if leading/trailing whitespace is a problem
  • Lookaheads are your best friends when it comes to handle optional syntax elements at the end of a meta scope
  • While working on a syntax’s context:
    • Annotate the stack level in side comments (it’s so easy to loose track of how deep in the stack each context and included statements are)
    • For reusable syntax-elements contexts, consider creating both a popless version and another one that pops out of the stack (sometimes you might be including them, other times pushing/setting them)
  • In lack of context-stack debugger:
    • add some arbitrary label to scopes in order to be able to track in the highlighted code which context is active (when there are variants) — eg: by using keyword.control.NONE.alan and keyword.control.IdOrDOTalan I was able to uncover the leftover whitespace problem which was causing the wrong context to be used.

I’ve learned these small lessons the hard way, by running in circles for hours because I wasn’t mind-tracking correctly the stack levels. Also, I struggled a lot with include vs push vs set choices, trying to adpat the context to my own likings and pre-existing reusables, which turned out to be a very bad approach.

When starting to deal with lots of reusable contexts, and contexts nestings, it can quickly become a complex task to keep a clear mental picture of what is actually going on at the parser level. Unfortunately, we can’t escape the unpleasant task of having to mentally track what RegEx patterns are capturing, consuming, discarding and how the various stacked contexts loop until they pop out.

If ST had a way to expose to the user the syntax parser’s stack state, its ques, and some debug info about the text being processed, the regexs matches and failures, it would be much easier to trace where our custom syntaxes fail.

Any chance that (somewhere in the future) ST might also ship with a command line tool to debug syntax definitions? A console app that takes a syntax file and a source test-files as input and spits out two files: a scoped version of the souce file (an XML like doc tree) and a log file listing all the innerworkings of the parser engine. This would be an invaluable tool to both learn how to build syntaxes as well as to fix problems.

1 Like

#7

please consider contributing your tips to

it’s for this reason that I’ve been using https://github.com/trishume/syntect, specifically the syntest example with the --debug argument. Having native support for this in ST would be invaluable :wink:

2 Likes

#8

What about the following?

EVERY foo
END EVERY
foo
.

In this example, shouldn’t foo and the trailing dot be scoped? I think that your example will get this wrong.

If ST had a way to expose to the user the syntax parser’s stack state, its ques, and some debug info about the text being processed, the regexs matches and failures, it would be much easier to trace where our custom syntaxes fail.

Any chance that (somewhere in the future) ST might also ship with a command line tool to debug syntax definitions? A console app that takes a syntax file and a source test-files as input and spits out two files: a scoped version of the souce file (an XML like doc tree) and a log file listing all the innerworkings of the parser engine. This would be an invaluable tool to both learn how to build syntaxes as well as to fix problems.

I’ve thought about an interactive web-based debugger. The biggest obstacle is the lack of a really good Oniguruma implementation in JavaScript. I suppose that a native-Sublime debugger could be useful as well, but you’d run into that same problem.

0 Likes

#9

You’re right about this:

EVERY foo
END EVERY
foo
.

Right now, it doesn’t get correctly highlighted — well, actually it does, except for the trailing optional part which is left out of the class meta scope. Let’s say it’s not a big deal for the alpha stage.

I’ve written quite a few syntax for other higlighters in the past, and my experience has been that some edge cases related to new line whitespace can’t be caught except by going to great lenghts in order to cover them — often ending up with a regex expensive syntax. The problem is that in theory there could even be 500 lines with 100 space characters each, and it would still be a valid syntax, but to catch it with a regex would most likely start to hit on performance. Even if I leave out the option ID and dot, as far as syntax-highlighting goes, the class is fully scoped, and the leftover part will be unscoped garbage.

Still, your point is right and I’ve duely noted it down. I’m just postponing edge cases for the time being, in favour of implementing more sytnax constructs.

Alan being a language for text adventures, where code and prose blend together, and a syntax which tries to hide the code nature as much as it can, it’s going to be challenging in many parts. Being whitespace insensitive, and because it uses almost no punctuation symbols at all (brackets, commas, etc.), the challenges ahead are plentiful — especially for blocks of statements that don’t even use the END keyword. I don’t even know if I’m going to be able to cover them all properly, but I’ll try.

The expectation is that most users will avoid a syntax like above: they’re going to use the optional ID only if the block is really long, in order to understand what the end part of it stands for, and hopefully if they use the ID and the dot it would be on the same line. But as I state in the project’s page, currently there are no plans to make it a full fledged syntax package, it’s more of a personal project which I’m making available to others “as is”; so, for now I have some freedom of enforcing opinionated code styling and ignoring unhorthodox code styling. Of course, if the syntax becomes used by many, then I might consider polishing it to almost perfection and publish it via Package.io — but that’s still a long way to go: the language is quite rich, and there is a lot to cover still.

Right now I just need a break from it all; I’ve spent so many hours to get it straight, while everybody here in Italy is outside enjoying the first sunny days of summer — ie, everybody except me apparently. I think I’m going to call the day on it, and take the dog for a nice and long walk. :slight_smile:

0 Likes

#10

The good news is that handling odd whitespace is much easier with sublime-syntax than with other systems. In other syntax highlighters, complex constructs require complex regexps. In Sublime, you generally want to avoid complex regexps by matching only one token at a time. The captures rule should be used sparingly, and you should only rarely need to explicitly account for whitespace. Any time you might try to look at two tokens with a single regexp, you can probably avoid the edge cases entirely by using several simpler rules instead.

The exceptions to this are generally situations that are provably impossible to handle in a deterministic context-free parser. JavaScript has a couple of these. I suspect that this language does not.

In this case, there are two options. The first is to allow the meta.class scope to continue until the next construct if the dot is omitted. This is probably what I’d do. It’s the simplest solution, and I don’t see any downside. The second option is to guess that END EVERY\n won’t be followed by foo or a dot, and have a way to recover from a wrong guess there. This is a little tricky, but doable; it comes up a couple of times in the JavaScript syntax. I doubt it’s worth the effort here.

0 Likes

#11

@kingkeith

please consider contributing your tips to …

Done (and thanks for the link, didn’t know about it):

it’s for this reason that I’ve been using https://github.com/trishume/syntect, specifically the syntest example with the --debug argument.

I’m looking at it, and it does seem a very good fallback solution. My only questions is, even though it uses the same syntax format as ST, how can we know if its parser works 100% the same? ie: some edge cases might be handled differently, especially if there is a bug in one or the other.

True, if it can handle well most syntaxes so far we should expect it be comliant. But since ST is closed source there is no real way to know expect by stumbling in different behaviours of the two. If I understand correctly, ST is inspired by Textmate in its use of grammars, but the sublime-syntax file format was created by ST (and, for example uses a different RegEx engine).

How was your experience with it so far? Does is support all syntaxes with a 100% match?

0 Likes