Sublime Forum

Tree-sitter support

#1

Any plan to add support for tree-sitter parsers?

4 Likes

#2

Very unlikely. Sublime’s syntax engine already provides most of the benefits. What Sublime’s parser doesn’t handle are nondeterministic context-free languages, but this support could be added to the current parser, so there is no reason to switch to a new one.

3 Likes

#3

Just wondering if you work for sublime HQ?

I would think there are plenty of things that would be improved. Custom type highlighting, code folding and error recovery for starters. In fact everything mentioned in this talk:

1 Like

#4

I do not work for Sublime HQ. I do, however, know a great deal about Sublime’s syntax highlighter, syntax highlighting in general, and the broader topic of formal language theory. I am well qualified to talk about these things and, more importantly, to explain them. If you have any questions I would be happy to address them.

I have not seen that video, but I have seen other explanations of Atom’s tree-sitter implementation by the same presenter. The GitHub employee does not understand how Sublime’s engine works, and some of the things that he says about it are not true.

… Why is there this inconsistency here? And it’s because none of these tools really understand Go code, or any code for that matter. They’re just kind of looking for simple patterns in the code that they can identify in regexes.

He is describing, more or less, the old TextMate syntax highlighting system (tmLanguage). It’s a fairly simplistic system that doesn’t allow for very sophisticated parsing. Since its introduction, it has become a de facto standard for many text editors, and it is supported by Sublime, Atom, and Visual Studio Code. I would argue that his brief description is not entirely fair to the TextMate system, but on the whole it’s generally accurate.

However, since 2015 Sublime has supported a different, more powerful syntax highlighting system (sublime-syntax). Using this system, you can write syntax definitions that understand any deterministic context-free language. This is a very broad category that includes most programming languages. There are some caveats; for example, a deterministic context-free parser cannot always correctly identify arrow functions in JavaScript, and a context-free parser cannot completely handle some whitespace-sensitive languages.

From a practical perspective, the tree-sitter system is very similar to Sublime’s system. There are very few things that one can do that the other cannot. The exact details depend not only on the algorithms, but on Atom-specific implementation details I can’t comment on, but Atom may be able to exactly identify arrow functions in JavaScript. On the other hand, I suspect that the more formal structure of the tree-sitting algorithm may preclude some of the non-regular regexp tricks that Sublime syntaxes can use. If there are any tree-sitter syntaxes implementing heredocs, I’d be interested to see them.

The GitHub employee in the video shows an example of some Go code highlighted with an old TextMate-derived syntax definition. He thinks that all types should be highlighted the same way in the sample. That’s an opinion. Then, he claims that the reason they aren’t is that “none of these tools really understand Go code”. This is nonsense. Even a TextMate syntax definition could probably have highlighted all of those types the same way. I am absolutely sure that Sublime’s engine can. The reason they’re highlighted differently is that the specific syntax definition he’s using highlights them differently. He specifically claims that this is not an artifact of the specific syntax definition he’s using, but rather a fundamental limitation of the simplistic algorithms used for highlighting in other editors, specifically including Sublime. This is a false claim.

Sublime syntax definitions can be very simplistic or very sophisticated. The current built-in Go syntax definition is fairly simplistic (derived from an old TextMate syntax, I think, and probably from the same one that other editors use). However, there is an open pull request for a brand-new, much more sophisticated syntax definition that takes advantage of the greater power of Sublime’s engine.

Another example of a highly sophisticated Sublime syntax definition is the JavaScript syntax, which parses JavaScript code at a very granular level. You can take a look at that syntax definition and see how many of the contexts are basically just translated from the ECMAScript language spec. The only reason that there isn’t a one-to-one correspondence — that the syntax definition doesn’t do exactly the same work that a JavaScript interpreter would — is that many of the fine details are irrelevant to the task of syntax highlighting, and there’s no need to spend lines of code and CPU cycles doing nonproductive work. On the other hand, it’s as complex as it is because the JavaScript language is unusually hard to parse (e.g. telling whether a slash is a division operator or the beginning of a regular expression).

So on one hand, the GitHub employee is wrong about the sophistication of Sublime’s parser, and on the other hand he’s wrong about the level of sophistication required to do the job.

The tree-sitter system is perfectly fine and there’s nothing wrong with it. It’s arguably more powerful than Sublime’s, depending on what capabilities you prioritize, although the difference is small either way. It does have some disadvantages, though. Most importantly, it’s slow. Sublime’s syntax definitions are YAML files that are compiled to giant state machines and then executed in C code. Atom’s tree-sitter implementation seems to involve one or more intermediate layers of JavaScript code. I’d like to see the benchmarks, but I’m very confident that as efficient as the V8 engine is, it’s not as fast as optimized C code doing substantially less work. The tree-sitter algorithm compensates for this by minimizing reparsing, which is the algorithm’s primary selling point. It may be that once the initial parse is complete, the tree-sitter algorithm is fast enough for practical purposes. In fact, I’d be surprised if it weren’t. But that initial parse is simply going to take longer. Sublime uses different optimizations to minimize reparsing. In principle, they provide weaker guarantees, but in practice they’re a lot simpler and more predictable.

Error recovery is a bit of a non-sequitur. A well-written syntax definition can sensibly handle invalid code. Atom is not special in this regard. Perhaps it’s of particular concern due to the formality of the tree-sitter algorithm. In the video, the GitHub employee describes in some detail how tree-sitter handles syntax errors using nondeterminism. You don’t need nondeterminism to solve that sort of problem. Nondeterminism is slow. (This is a theorem, not an implementation detail; nondeterministic context-free languages cannot be parsed as effectively as deterministic context-free languages.) Even if Sublime implemented nondeterministic parsing, I wouldn’t use it for that. It’s not clear to me whether this is a real problem with tree-sitter, the syntax definition was badly written, or the GitHub employee is just using it as a simple example. From the way he congratulates himself for the originality, probably not the third.

Code folding has barely anything to do with this, and this comment is long enough as it is, so I’ll punt on that.

13 Likes

#5

My understanding is that javascript is just used to define the grammer.
That js file is then used to as the data source to generate c code that compiles to a standalone parser.
So the V8 engine is not needed after it is complied.

It seems tree-sitter would solve, for example, this problem I have:
Language uses a space for concatenation:

string s = "123" "456"

Language can call functions without parenthesis:

functionName "Arg"

So how should this be scoped:

something0 something1

It depends on what “something0” and “something1” are, and with sublime-syntax you can’t tell.

Variables:

string something0 = "123"
string something1 = "456"
something0 something1

Function and Variable:

string something0(string s) { return s }
string something1 = "Arg"
something0 something1

CustomType and Variable:

struct something0
something0 something1

At the moment I solve this by:

  1. Always using parenthesis around arguments with no whitespace before
  2. Always using whitespace before a parenthesized expression.
  3. Using all caps for custom types and nothing else.

That is no use if I’m looking at code I didn’t write that doesn’t follow those conventions.

If there is a way to resolve that with sublime-syntax, I’d be interested to know how.

1 Like

#6

My understanding is that javascript is just used to define the grammer.
That js file is then used to as the data source to generate c code that compiles to a standalone parser.
So the V8 engine is not needed after it is complied.

If that’s the case, I’d still expect it to be slower than Sublime’s engine. JavaScript compiled to C is still going to have some overhead compared to ordinary C.

… It depends on what “something0” and “something1” are, and with sublime-syntax you can’t tell.

Tree-sitter can’t either. That’s not a context-free distinction. You would need a context-sensitive parser to tell what the type of something0 is. Depending on the language, it may even be undecideable. What if something0 could be either a string or a function?

2 Likes

#7

That could be the case, but since it builds a syntax tree, I got the impression it would be able to.
I thought it would use what it had already parsed to determine better what a token represents.
Could have misinterpreted what it does though.

1 Like

#8

A good explanation is Jim Roskind on C ambiguity.

A tree-sitter is basically a GLR parser, and GLR parsers are context-free. Therefore, it’s impossible for tree-sitter to handle this kind of ambiguity. From the link:

Other threads of response included the use of a GLR parser, which tries all alternatives. Alas, as will be shown, the ambiguity in C is too great, and such an approach will not provide a unique syntactically correct parse when it does exist.

…

Since one of the respondees included suggesting that a GLR parser could just try all interpretations, and that only one would be valid, the following is a rather interesting example:

typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}


>   Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were still a typedef-name, as a cast to type T of the expression "+1".  Note that the GLR approach would yield two distinct and valid parses, and hence there would be no single winner (without additional heuristics).

Real C parsers handle the ambiguity in the lexical analysis stage. They build up a symbol table as they go along, so they can tell when they encounter a name whether it belongs to a type or a variable.

[Tree-sitter's documentation](http://tree-sitter.github.io/tree-sitter/creating-parsers#lexical-analysis) indicates that the algorithm does separately perform lexical analysis, but as far as I can tell there is no way to write a custom lexer that maintains an internal symbol table and uses that table to distinguish variables from types. In any event, a custom lexer of that sort probably couldn't be compiled in the same way that the grammar is, and even if it could it would mean an extra layer of abstraction between the code and the parser, which wouldn't be great for performance. Plus, I'm skeptical that arbitrary user-defined lexers would be compatible with tree-sitter's reparsing guarantees, but I could be wrong about that.

I suspect that Sublime's engine actually could be augmented to handle C types. Sublime's syntax definitions are essentially descriptions of a pushdown automaton. Executing a PDA is very simple and “dumb”, so you're not likely to run into thorny algorithmic dilemmas when extending its behavior. In the [nondeterministic parsing proposal](https://github.com/SublimeTextIssues/Core/issues/2241), I identify only one place where the proposed behavior might interfere with existing behavior, and it's a performance optimization that could be remedied. The tree-sitter algorithm is much more sophisticated, which makes it potentially brittle. Adding context-sensitive features would likely violate core invariants of the algorithm, and it might not be easy or even possible to remedy those invariants.

For instance, I can't find what regex flavor tree-sitter uses, but I suspect that it doesn't support non-regular features like backreferences. Sublime does — it has its own highly-optimized regex engine that only handles truly regular expressions, but it will fall back to Oniguruma to handle non-regular features. There is a performance penalty, but when used sparingly backreferences add real expressive power to the engine that I'm not sure tree-sitter can match.
4 Likes

#9

If that’s the case, I’d still expect it to be slower than Sublime’s engine. JavaScript compiled to C is still going to have some overhead compared to ordinary C.

Your understanding is wrong.

JavaScript is only used as a DSL for writing grammars. (Alternatively, grammars can be written in plain JSON.) From the grammar a C parser is generated. No JavaScript is ever compiled to C.

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting. The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

PS: A Rust rewrite of tree-sitter’s JavaScript tooling is underway.

1 Like

#10

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting.

Perhaps. I see three chief obstacles to this happening:

  1. The tree-sitter algorithm is much more complicated than a simple PDA-based highlighter despite being fundamentally no more powerful. (As I’ve mentioned, nondeterminism could be added to the sublime-syntax system without significantly changing it.)
  2. The sublime-syntax system supports some non-context-free features that the GLR algorithm cannot ever replicate. The tree-sitter implementation allows developers to accommodate these features by writing custom C extensions, which is inconvenient to say the least.
  3. The current cross-editor standard is tmLanguage. Authoring a sublime-syntax is very similar to authoring a tmLanguage, especially considering that many tmLanguage authors use the YAML representation anyway. Given a tmLanguage, an automated tool can convert it to a sublime-syntax, and the author can add whatever degree of syntactic sophistication is appropriate. The migration path for tree-sitter is unclear: an author must write a brand-new syntax definition using a different paradigm, and the grammar must be complete (if not comprehensive).

If tree-sitter really takes off, and there are high-quality tree-sitter definitions that would be useful to Sublime users, then I expect that someone would probably develop an automated tool to convert (deterministic) tree-sitter grammars to sublime-syntax definitions. I could be misremembering, but I believe that it’s generally easier to convert a grammar to an automaton than vice versa.

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

3 Likes

#11

As an addendum to the above, I want to clarify that I think that tree-sitter is, in itself, a perfectly good system. I like algorithmic complexity. I also generally prefer declarative systems (e.g. grammars) to procedural ones (e.g. automata). My technical concern about tree-sitter is that it is inflexible. I don’t see how the GLR algorithm could be extended to handle heredocs, whitespace, or other non-context-free features. If there is existing research on this, I would be interested to read it.

My other concern is that while tree-sitter is a fine system for producing high-quality syntax definitions, I’m not convinced that it’s a good system for producing okay-quality syntax definitions. I review packages submitted to the Package Control channel; every week, there are new syntaxes for languages or files I’ve never heard of. Most of these are simple sublime-syntax files written by non-experts solving a single problem. The sublime-syntax system seems to be a good match for this. Time will tell whether tree-sitter can fill the same role.

4 Likes

#12

Not to detract or tangent from the conversation, but this reminds be of a great XKCD comic:

2 Likes

#13

That’s a fun comic, but it doesn’t apply here. There was a de-facto standard (TextMate) which worked fine for a while. Now that it’s dead, a new standard is needed.

0 Likes

#14

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

Atom is moving to tree-sitter and no longer maintains its TextMate grammars. VSCode is stuck on TextMate grammars, which all the horrors that this entails (>500 issues reported for syntax highlighting of TypeScript alone (!!!), ever more issues filed for TextMate grammars used by VSCode and no longer maintained by Atom). Sublime its betting on its own (from what I understand proprietary) highlighting system. As a language author, I can’t and don’t want to maintain three different grammars.

1 Like

#15

I never understood why Atom and VSCode decided to use TextMate grammars when .sublime-syntax files were already mature before those editors were conceived

1 Like

#16

If the goal is standardization, Atom would have been better off using sublime-syntax rather than making their own system that no one else supported. (I get the impression that the authors of tree-sitter didn’t really understand sublime-syntax and believed that it was basically a thin layer of syntactic sugar over tmLanguage.)

Sublime its betting on its own (from what I understand proprietary) highlighting system.

“Proprietary” in the sense that other editors would have to implement it themselves, but that’s not exactly difficult. A sublime-syntax definition is basically a spec for a DPDA that will parse the code. It’s an incredibly simple system, particularly next to the complexity of tree-sitter. There’s an open-source implementation written in Rust. Heck, I wrote a Node.js implementation once because I was bored, and it’s faithful enough that I discovered several new bugs in the process. Despite this simplicity, it is comparable in power to tree-sitter, extremely fast, and amenable to various avenues of extension.

Myself, I always expected Atom and VSCode to implement sublime-syntax. I was quite surprised when Atom went with its own, incompatible system instead. If VSCode ever implements a more modern system, I would be equally surprised if they implemented tree-sitter over sublime-syntax. (I wouldn’t be surprised if they implemented another new, incompatible format; it’s Microsoft, after all.)

I feel confident in predicting Sublime isn’t going to implement tree-sitter. It’s not inherently better than sublime-syntax, and the one thing that tree-sitter does that sublime-syntax doesn’t — nondeterministic parsing — could be implemented in sublime-syntax anyway. It’s a much more complicated system with more moving parts. It can’t handle some common context-sensitive constructs that sublime-syntax can, and there’s no obvious way to extend tree-sitter itself. The custom C lexers used by many Atom syntaxes probably wouldn’t port, making it hard to reuse existing definitions.

I think it’s far more likely that either someone will write a converter from tree-sitter to sublime-syntax definitions, or Atom will implement sublime-syntax alongside tree-sitter.

2 Likes

#17

Here’s my prediction: VSCode and github.com will follow Atom and support tree-sitter. Nobody will bother to maintain TextMate or sublime-syntax files. Let’s check back in a year!

1 Like

#18

If the problem is ignorance of sublime-syntax, perhaps we should reach out to Atom and VS Code authors, particularly the latter, and enlighten them. I could see them being wary of reverse-engineering this proprietary system; an official statement from the HQ might be helpful there.

2 Likes

#19

I’ve had a conversation with the author of tree-sitter.

I don’t get the impression that Atom decided to improve their highlighting, surveyed the options, and made a conscious decision to go with tree-sitter over sublime-syntax. Rather, someone wanted to implement tree-sitter, and no one offered to implement sublime-syntax. I took a quick look, and I didn’t see any issues that even mentioned sublime-syntax, so I took the liberty of opening one.

4 Likes

#20

You underestimate how old the Atom project is. There were bits and pieces of it at GitHub long before ST3 seemed to actually go anywhere (remember about 5 years ago you’d have to be quite a fan to expect ST to live on) and sublime-syntax was introduced. Also long before it was public.

Why on earth VS Code went with tmLanguage though, I don’t understand. It’s a bit naive though, to expect them, or anyone else, to converge towards a new system just because it exists.

3 Likes