Sublime Forum

Understanding the Syntax Parsing

#1

So, part of what I’ve been working on is a code beautifier that, more or less, aligns and indents the code properly based on scanning through the source document. So there’s a lot of scanning for keywords and parentheses and things of that nature going on.

It hasn’t escaped my notice that this is to some degree exactly what the syntax file is doing. While it’s been instructive to roll my own, it might be ultimately more efficient to lean on the editor’s own syntax context and scope rules. With that in mind, I’ve been studying the syntax file definition documentation and I’ve come up with a few questions.

  1. Am I correct in believing that the scope points are linked to the View object? That is to say, if I subsequently slurp the entire source code file into a string, I’ve then lost all the scope information? I think that must be true, but it’d be interesting if it weren’t.

  2. Asssuming 1 to be true, is there an API command to run the syntax scope analysis against a string?

In a different vein, the syntax file I have was created for TextMate a long time ago. It still works well, though I think perhaps some good could be done by getting more fine grained context, so I’ve been mulling over writing my own variation.

  1. The language I’m working in (VHDL) does not really care much about white space aside from at least some whitespace as a delimeter. It’s also kind of verbose (it’s related to Ada in that way). For example, there is a top level lexical object called an entity (not to be confused with the scope entity). The syntax goes a little like this:

    entity <valid_name> is

    end entity <valid_name>;

However, as far as the language goes, you could write it as:

entity
<valid_name>
is
end
entity
<valid_name>
;

Now, no one sane does this, but while I think that, I’ve run across a lot of variations on what people actually write out. Am I correct in believing the syntax engine would have a lot of trouble with matching in the extreme variation? If I’m reading it correctly, it says that matches are only ever applied against a single line. So, coping with this variation would require a lot of pushing and popping? Or is there a way to have a match apply over multiple lines?

This by the way is actually one of the errors in the current VHDL syntax language file I believe, though I’ve not figured out exactly how to correct it yet. A procedure body definition might look like this:

procedure <valid_name> [(
    <parameter block> )] is
begin
    <statements>;
end procedure <valid_name>;

It is completely valid to start the parenthetical parameter block on the next line, however if I do that, the captured name is incorrect and it flags the final line in error.

  1. Prioritization. The language has a few constructs that are capable of having additional words for different context. So in an entity declaration you may have a generic (): block. In an instantiation, you may have a generic map () block. If I’m trying to write a regex for this, will I need to look for generic map before generic and thus put it higher in the file? There also may be some tricks with context as well I can use as that map element is only ever used in the one place. However I also have a few weird things like some branching contexts. An assert statement can have a report clause within it along with a severity clause. However it’s entirely valid to have a report statement on its own line completely separate from assert.

Anyhow, any insight into the nitty gritty of this would be greatly appreciated. I’m going to continue to immerse into the basic syntax document, the unofficial syntax document, and then the scope naming (which is really pretty complex and I can’t say I’ve wholly wrapped my head around exactly which name should be applied in my situation.) Thanks in advance!

1 Like

#2

I have been experimenting with this kind of verbose languages a lot lately, and here is one way to do it. So let’s say we want to perfectly balance the following construct:

entity <valid_name> is

end entity <valid_name>;

no matter where these keywords reside in the buffer. Maybe they’ll span multiple lines, we don’t know :slight_smile: The following toy-syntax perfectly scopes the whole shebang, including giving the body a meta.entity.body scope. I have commented a lot of lines so I hope it’s clear what is happening.

%YAML 1.2
---
scope: source.test

variables:
  # Reusable variables that we may use in regular expressions.
  identifier: '[[:alpha:]_][[:alnum:]_]*'

contexts:

  # We start out here in the "main" context. It has one match rule. The "syntax"
  # stack contains only one element, namely this "main" context.
  main:
    # Look for the word "entity"
    - match: \bentity\b
      # If we have a match, scope it.
      scope: keyword.other.entity
      # And push THREE contexts on the "syntax" stack.
      # The stack will look like this: (from top to bottom)
      #  expect-entity-identifier
      #  expect-entity-is
      #  entity-body
      #  main
      push: [entity-body, expect-entity-is, expect-entity-identifier]

  expect-entity-identifier:
    # Look for any identifier. If we found it, pop this context off the stack.
    - match: \b{{identifier}}\b
      scope: entity.name.tag
      pop: true
    # If instead we find whitespace, just eat it. It's not an error.
    - match: \s+
    - match: \n
    # If instead we do not find an identifier nor whitespace, we can scope it
    # as an error, since we expect to find an identifier.
    - match: .+
      scope: invalid.illegal

  expect-entity-is:
    # Look for the word "is". If we found it, pop this context off the stack.
    - match: \bis\b
      scope: keyword.other.entity
      pop: true
    - match: \s+ # eat whitespace
    - match: \n
    - match: .+ # anything else is an error
      scope: invalid.illegal

  # When we arrive here, the stack should look like this:
  #  entity-body
  #  main
  entity-body:
    - meta_scope: meta.entity.body
    - match: \bend\b
      scope: keyword.control.end
      # Instead of a "push", we use a "set". A "set" is just a "pop-and-push".
      # So we first pop off the entity-body context, and then push these three
      # contexts on the stack. The stack should then look like this:
      #  expect-entity-keyword
      #  expect-entity-identifier
      #  expect-entity-terminator
      #  main
      # Note that we re-use the expect-entity-identifier context, but we use it
      # in a different context now (pun intended).
      set: [expect-entity-terminator, expect-entity-identifier, expect-entity-keyword]

  expect-entity-keyword:
    # Pop when we find the word "entity"
    - match: \bentity\b
      scope: keyword.other.entity
      pop: true

  expect-entity-terminator:
    # When we are here the stack should look like this:
    #  expect-entity-terminator
    #  main
    - meta_content_scope: meta.entity.body
    # Once we find a semicolon we pop and we are back in the main context, where
    # we continue as normal.
    - match: ;
      scope: meta.entity.body punctuation.terminator
      pop: true

Unfortunately, this technique does not allow you to “match” the entity’s names with each other (maybe it does with some elaborate construct, I haven’t put much thought into it). It’s possible to re-use capture groups for matches, but only in a limited way. Perhaps matching the names/identifiers could be the job of a plugin. In any case, the result should look like this:

1 Like

#3

Yowza. Thanks. I think I was correct in suspecting that this was going to be a tough nut. You also touched on another question I forgot to ask, whether capture groups can span constructs after you’ve closed them out (seems like the answer is ‘no’).

I’ll need to study this and decide whether I want to perfectly handle the language, or make some compromise into how most engineers would write it. This was the simplest construct as an example. If I were doing a more ‘normal’ variation it’d look like:

entity some_entity is
    generic 
    (
        DATA_WIDTH : integer := 8
    );
    port
    (
        clk : in std_logic;
        reset : in std_logic;
        d : in std_logic_vector(DATA_WIDTH-1 downto 0);
        q : out std_logic_vector(DATA_WIDTH-1 downto 0)
    );
end entity some_entity;

And of course, all that stuff in the middle is optional which adds its own brand of weird onto matching this stuff. When doing my indenter, I was basically looking for important keywords (like ‘entity’ here), pushing an end construct onto a stack, then rescanning the line, and then scanning each after that until finding it, while also handling the others and unbalanced parens. It works but it’s not unlike the context push and pop commands.

And the more elaborate scope rules might let me do a sort of large project wide outliner where I can find where things are declared, their bodies, and then where instantiated (read: used).

But as always… baby steps :). Thanks for the example!

0 Likes

#4

Yes, once you extract the contents of the buffer (or part of it) out to a string, you just have a regular string. I would guess that the scope is actually linked to the buffer that the View represents and not the view itself (as you can have cloned views) but I don’t know for sure and that’s just a nitpick at best since it operates the same either way.

As far as I’m aware the only way to get a syntax applied is from within a buffer/view so to get an analysis for a regular string you’d likely need to create a temporary view and insert the data into it.

Note however that the information that the API returns when you query scopes in a view directly translate to offsets in the string that represents the buffer (either singly as a point or as a begin/end in a Region object), so presuming that the view you extracted the string from was still open, you could use it to find points of interest and use that to get at the appropriate parts of the string.

For example, to list all of the classes in the current Python file, both of these are equivalent:

>>> class_list = view.find_by_selector("entity.name.class.python")
>>> [view.substr(loc) for loc in class_list]
['PackageFileSet', 'OverrideDiffResult', 'PackageInfo', 'PackageList']
>>> class_list = view.find_by_selector("entity.name.class.python")
>>> view_content = view.substr(sublime.Region(0, view.size()))
>>> [view_content[loc.a:loc.b] for loc in class_list]
['PackageFileSet', 'OverrideDiffResult', 'PackageInfo', 'PackageList']

That said, changes to the string (e.g. changing the code alignment) will stop the locations from lining up. In that case it may be better to just make modifications directly to the buffer and not extract its contents first, if possible.

0 Likes

#5

Thanks. It’s as I expected. And exactly the comment at the end is why I was wondering if I could reapply the scope analysis. If I’m playing around with the alignments, then that’s going to necessarily change the scope at that position and then I wouldn’t have a 1-1 representation anymore. However I imagine Sublime has to have a method that it calls to create the scope index but it could be too tightly bound to View to make it readily callable for a string.

For instance, one thing that I’ve managed to do already is create a command where you put the point anywhere in that entity construct, and it’ll parse out the elements of the interface. This is useful in that language because there is sometimes a lot of copying that dumb thing around into one place or another. I did this by grabbing the point location and then using commands to grab strings up until I find a beginning lexical structure, then from there down to the containing lexical structure, then extracting the middle, basically creating my own region based on the point and the language.

(Note, I have a sneaking suspicion that this is possible with the API commands that search based on ‘key’ but when I started I had no idea what the key was it was referring to as it’s not defined in that particular page – it’s very likely the syntactical key though.)

Anyway, plenty to process and I’ve got a week here were I can bang on it some more. The amazing thing is how much goes into all this and there’s that vhdl-mode that’s existed in emacs for probably a good 20-30 years. I will say it’s a lot more interesting learning Python for this than trying to learn Lisp :-).

2 Likes

#6

I would like to recommend reading the whole thread of sublimehq/Packages#115, which might help understanding new features of the *.sublime-syntax files should you plan to further develop that syntax.

1 Like