Sublime Forum

Is is possible to create syntax definitions that can refer up the context stack?

#1

I’m starting to learn Sublime Text 3’s YAML syntax rules. So far, it seems pretty powerful and slick. I use the Ada language and I have never found a good syntax coloring system for Ada or languages like Ada. I am hopeful that Sublime Text 3’s system can do a better job.

Ada does not use { and } to denote blocks - instead there are keywords. Also, like most languages, white-space (including new-lines) can be compressed to a single space between identifiers. Hence, an Ada program can be written on a single line or with each parser token on its own line.

Here is an example:

procedure FOO is
    <var-defs>
begin
    <statements>
end FOO;

procedure
FOO
is
<var-defs>
begin
<statements>
end
FOO
;

Firstly, because Sublime Text matches patterns of single lines, I realize that I have to create matching rules that contain single parser tokens. If I do more than that, then the syntax styler will only match if I write my code in the way the styler is expecting which is not a language requirement.

Now if I want to enforce syntax rules, like “you can’t have the word procedure twice in a row” or “you can’t have two identifiers naming the procedure before the ‘is’ keyword” then each keyword in a syntax progression then is it correct to say that I have to put each parser token in its own context?

In Ada, you’re reached the end of a code block when you encounter the ‘end something;’ statement.

loop
end loop;

if x then
elsif y then
else
end if;

and nested:

if x then
    loop
    end loop;
end if;

What I think I need to do is match the ‘if’, ‘loop’, and ‘FOO’ from the start of the block with the ‘end IF’, ‘end LOOP’, and ‘end FOO’ respectively at the end of the block. I’ve seen the trick in the documentation of carrying captures forward to the next context, but can this be done to a context much further down the scope tree with named captures?

My other question is that if I have multiple matches in a context, can they be ordered according to the syntax production rules and once a match is matched then it cannot be used again for the current scope level? For example, if I write (ignoring the recursive content):

contexts:
  procedure-start:
      - match: \bprocedure\b
        scope: procedure
        push: proc-def

  proc-def:
      - match: '\b(?<procname>{{identifier}})\b'
      - match: \bis\b
      - match: \bbegin\b
      - match: \bend\b
      - match: '\b\k<procname>\b'
      - match: ';'
        pop: true

Once I’m in the proc-def context, any of these rules can match in any order. Right?

This means the following text will highlight as if it were a legal procedure definition:

procedure BLAH FIBBLE is is
begin is end begin ;

Clearly not legal syntax and I hope to indicate as such.

What I hope I can do to avoid having to put each token into it’s own context is to have these matched in order (or some specified order) and once matched be taken out of eligibility for the current scope (perhaps optionally by YAML rule) so it will not be matched again. If an ineligible rule is matched, then a syntax error is identified. Can something like this be done?

Also notice the named capture. Is that possible? I can’t seem to get it to work.

I have high hope that this new language definition syntax will be better than anything before, but the learning curve is steep.

Thank you to any who can lend their experience.

0 Likes

Complicated Syntax YAML Structure
#2

Ideally what I’d like to do is take the BNF production rules for a language and convert them into a YAML syntax definition file. That would create scope names that were identical to and relatable to the documented BNF names in the Language documentation.

0 Likes

#3

Yes, this is correct. The context stack is an abstraction of your state machine and will be used to track where you currently are within the syntax.

For your situation, you can basically choose between two options. They both have advantages and disadvantages, but they all revolve around using specific contexts for each position you are in with regards to the expected token(s).

The first option is to create a context that matches the expected token, e.g. then and pops the current context at the same time. For this, you would have a match for keywords denoting compound statements, like if, which would then push multiple contexts on the stack at the same time. An example for this can be found in the docs.
The trick here is that the first match, the if keyword, denounces the following syntax tokens all at once, while the individual contexts pop themselves off the stack once they matched what they wanted.
For simplicity, you may combine a few contexts in case they only occur together in a certain combinations, but there is no rule of thumb here. Generally aim for reusability.
Invalid tokens are not too easy to handle, because you potentially have to bail out of a multiple contexts at the same time to reset your state machine to a usable state, and you cannot pop off multiple contexts at once. Each context needs to pop itself (using look-ahead matches, for example).

The other option is to have very specialized contexts for specific situations and only changing the stack by switching between them (i.e. using set). This approach is simpler because you only have one thing to keep track of in your mind (which is the context’s name, not the entire stack) but on the other hand you’ll end up with much more specialized contexts, more repitition and boilerplate, and you don’t make use of the powers of a stack. It is much easier to bail out of these specialized contexts because you can easily reset the state to something usable once you encounter an invalid token.

I have included two examples of the first option, in which I simplify contexts to a varying degree. I generally dislike having redundancy in code and in syntax definitions, which is why I generally prefer the first option and try to work around invalid token matching with smart patterns.

  statements:
    - include: if-statement
    # more statements ...
    # match still unmatched keywords in case they occur verbatim (failsafe approach)
    - match: \b(end|else|elsif)\b 
      scope: keyword.control.flow.ada
    
  expression:
    - include: function-calls
    - include: identifiers # or "variable"
    - include: operators

  if-statement:
    - match: \bif\b
      scope: keyword.control.flow.if.ada
      push: [if-body, if-conditional]

  if-conditional:
    - match: \bthen\b
      scope: keyword.control.flow.then.ada
      pop: true
    - include: expression
    
  if-body:
    - meta_scope: meta.conditional.ada
    - match: \belsif\b
      scope: keyword.control.flow.elsif.ada
      push: if-conditional
    - match: \belse\b
      scope: keyword.control.flow.else.ada
    - match: \bend if\b
      scope: keyword.control.flow.end-if.ada
      pop: true
    - include: statements

###############

  statements:
    - include: if-statement
    # more statements ...
    # match any unmatched keywords as illegal, since they are probably not valid in this context
    - match: \b(end|else|elsif)\b 
      scope: invalid.illegal.unexpected-keyword.ada
    
  expect-expression:
    # I'll simplify this and just pop before various keywords
    - match: (?=\b(if|then|elsif|end)\b)  # could be stored as {{keywords}} variable
      pop: true
    # ...

  if-statement:
    - match: \bif\b
      scope: keyword.control.flow.if.ada
      push: [if-body, expect-then, expect-expression]
      
  expect-then:
    - match: \bthen\b
      scope: keyword.control.flow.then.ada
      pop: true
    - match: \S
      scope: invalid.illegal.expected-then.ada
      
  if-body:
    - meta_scope: meta.conditional.ada
    - match: \belsif\b
      scope: keyword.control.flow.elsif.ada
      # Note that I'm setting if-body here, although it's redundant, but it expresses the intent better
      set: [if-body, expect-then, expect-expression]
    - match: \belse\b
      scope: keyword.control.flow.else.ada
      # I'm replacing the if-body context here to disallow further "elsif" and "else" occurances
      set: [else-body]
    - include: else-body
    
  else-body:
    - meta_scope: meta.conditional.ada
    - match: \bend if\b
      scope: keyword.control.flow.end-if.ada
      pop: true
    - include: statements

This is correct.

Named captures are not recognized. You must use positional captures. Fortunately, the PackageDev package assists you with this.

2 Likes

#4

Wow. Thanks.

I can’t seem to find it now, but I thought I saw a reference to pop that took a number as a parameter instead of true. If that does indeed exist and assuming I knew how many contexts to pop, this may be a possibility. Unfortunately, if think that look-ahead matches will never work in free form languages when only one line of source material is used for the match unless assumptions are made about how the source material will be formatted.

The second approach would be much more difficult to maintain. I was really hoping for something where I could map each BNF production to a single context. When the language evolved, it would be much easier to maintain the YAML because you could quickly and easily identify where the change needed to be. However, I think additional features are needed in the YAML definitions for this to be possible. In addition to scope and named context, state information within each context would need to be maintained.

Thank you for your examples. I’ll have to work through some examples myself using both methods to try to decide which approach is best. This is starting to feel like a really big project.

0 Likes

#5

This must have been a feature request as this isn’t possible currently.

You could probably generate a syntax definition based entirely on a BNF, but I don’t think a tool for this exists currently. You’d have to have a context for each state in the grammar (including multiple scopes for one BNF rule, depending on complexity) and you’d need to assign scope names.
Either way, while fundamentally not too different from BNF rules, contexts in syntax definitions are much more fine-grained, scope naming aside, and flexible. And for syntax definitions you need to consider a user that is constantly changing a file and might produce syntax errors temporarily, for which you ideally do not highlight the rest of the file as invalid.

Here’s a commit where I added something similar to the Python syntax (followed up by another PR):

This only works properly if you pay attention to the number of contexts pushed on the stack, which works good for python since meta blocks aren’t tracked (doesn’t make sense in languages with significant whitespace) and a healthy mix of the two options I provided earlier is used. I don’t think I can explain this better right now since it’s really late over here, but I suppose this will give you a good start.

0 Likes

#6

It sounds like this will actually be a lot of hours of planning and work for very little gain because of some simple limitations. The only reason to spend the time would be for more than just colored keywords and numbers which sounds like is the best this can do for Ada easily. You mentioned a theoretical tool that could take BNF and generate the YAML rules for me – that may be the better (and more fun) project. There would then be no maintenance – only to get the new BNF after the each language revision and run the YAML generator again.

0 Likes

#7

I thought at one point about generating syntax definitions directly from BNF, but there are a couple of points of difficulty.

First, the produced syntax would be very verbose. The ECMAScript package is very faithful to the language spec, but it’s over five thousand lines. Editing such a syntax by hand would be difficult, to say the least.

Second, it might be overly strict. Syntax highlighting needs to work gracefully for invalid code, such as when you start typing in the middle of a file. It seems unlikely that a syntax programmatically derived from BNF would do this well.

I do find a formal syntax to be an ideal starting point for writing a Sublime syntax definition, and I wish that more languages had them.

2 Likes

#8

Yeah I had put in a feature request for saved captures and multi-pop. The multi-pop didn’t turn out to be a very difficult thing to work around, because one can push the initial context, and then set the second one. However as noted there’s no way to do illegal end clause identifier matching this way. It’s a choice whether to subscope in the same structure (declaration/body/etc) and gain the ability to set rules for what structures can be in those sections, or you create a simple context/scope for the entire structure, and gain the ability to catch the end clause identifier.

0 Likes

#9

Just as a note because VHDL was based on Ada, instead of looking for every possible combination of code, I found it very useful to group classes of structures. Again, VHDL example, but hopefully relevant.

We have a large code structure called the architecture and it has a fairly basic declaration and body structure. In the body we have code structures that are concurrent. One of these structures, the process, has structures interior to it that are sequential. For example:

architecture rtl of foobar is
    <declaration block>
begin
    <concurrent block>
    MY_PROCESS : process 
    begin
        <sequential statements>
    end process MY_PROCESS;
end architecture rtl;

So, what I did was create structures for each of my sequential statements. This is actually pretty straight forward. You usually have if conditionals, assignments, and loops. Using include I put these into a context called sequential-statements. I did the same for structures that are concurrent. These are in a context called concurrent-statements. Some of the concurrent statements, like process include in the body section the sequential-statements context.

For strongly hierarchical languages like VHDL (and presumably Ada) you have an excellent foundation for matching these lexical groups with matching hierarchy inside the definition. It’s difficult to achieve full tokenizing, but you can get really close.

0 Likes