Sublime Forum

**ThomSmith** · November 23, 2018, 6:07pm

A good explanation is Jim Roskind on C ambiguity.

A tree-sitter is basically a GLR parser, and GLR parsers are context-free. Therefore, it’s impossible for tree-sitter to handle this kind of ambiguity. From the link:

Other threads of response included the use of a GLR parser, which tries all alternatives. Alas, as will be shown, the ambiguity in C is too great, and such an approach will not provide a unique syntactically correct parse when it does exist.

…

Since one of the respondees included suggesting that a GLR parser could just try all interpretations, and that only one would be valid, the following is a rather interesting example:

typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}


>   Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were still a typedef-name, as a cast to type T of the expression "+1".  Note that the GLR approach would yield two distinct and valid parses, and hence there would be no single winner (without additional heuristics).

Real C parsers handle the ambiguity in the lexical analysis stage. They build up a symbol table as they go along, so they can tell when they encounter a name whether it belongs to a type or a variable.

[Tree-sitter's documentation](http://tree-sitter.github.io/tree-sitter/creating-parsers#lexical-analysis) indicates that the algorithm does separately perform lexical analysis, but as far as I can tell there is no way to write a custom lexer that maintains an internal symbol table and uses that table to distinguish variables from types. In any event, a custom lexer of that sort probably couldn't be compiled in the same way that the grammar is, and even if it could it would mean an extra layer of abstraction between the code and the parser, which wouldn't be great for performance. Plus, I'm skeptical that arbitrary user-defined lexers would be compatible with tree-sitter's reparsing guarantees, but I could be wrong about that.

I suspect that Sublime's engine actually could be augmented to handle C types. Sublime's syntax definitions are essentially descriptions of a pushdown automaton. Executing a PDA is very simple and “dumb”, so you're not likely to run into thorny algorithmic dilemmas when extending its behavior. In the [nondeterministic parsing proposal](https://github.com/SublimeTextIssues/Core/issues/2241), I identify only one place where the proposed behavior might interfere with existing behavior, and it's a performance optimization that could be remedied. The tree-sitter algorithm is much more sophisticated, which makes it potentially brittle. Adding context-sensitive features would likely violate core invariants of the algorithm, and it might not be easy or even possible to remedy those invariants.

For instance, I can't find what regex flavor tree-sitter uses, but I suspect that it doesn't support non-regular features like backreferences. Sublime does — it has its own highly-optimized regex engine that only handles truly regular expressions, but it will fall back to Oniguruma to handle non-regular features. There is a performance penalty, but when used sparingly backreferences add real expressive power to the engine that I'm not sure tree-sitter can match.

**fred.curts** · January 19, 2019, 7:12am

If that’s the case, I’d still expect it to be slower than Sublime’s engine. JavaScript compiled to C is still going to have some overhead compared to ordinary C.

Your understanding is wrong.

JavaScript is only used as a DSL for writing grammars. (Alternatively, grammars can be written in plain JSON.) From the grammar a C parser is generated. No JavaScript is ever compiled to C.

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting. The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

PS: A Rust rewrite of tree-sitter’s JavaScript tooling is underway.

**ThomSmith** · January 19, 2019, 8:12am

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting.

Perhaps. I see three chief obstacles to this happening:

The tree-sitter algorithm is much more complicated than a simple PDA-based highlighter despite being fundamentally no more powerful. (As I’ve mentioned, nondeterminism could be added to the sublime-syntax system without significantly changing it.)
The sublime-syntax system supports some non-context-free features that the GLR algorithm cannot ever replicate. The tree-sitter implementation allows developers to accommodate these features by writing custom C extensions, which is inconvenient to say the least.
The current cross-editor standard is tmLanguage. Authoring a sublime-syntax is very similar to authoring a tmLanguage, especially considering that many tmLanguage authors use the YAML representation anyway. Given a tmLanguage, an automated tool can convert it to a sublime-syntax, and the author can add whatever degree of syntactic sophistication is appropriate. The migration path for tree-sitter is unclear: an author must write a brand-new syntax definition using a different paradigm, and the grammar must be complete (if not comprehensive).

If tree-sitter really takes off, and there are high-quality tree-sitter definitions that would be useful to Sublime users, then I expect that someone would probably develop an automated tool to convert (deterministic) tree-sitter grammars to sublime-syntax definitions. I could be misremembering, but I believe that it’s generally easier to convert a grammar to an automaton than vice versa.

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

**ThomSmith** · January 19, 2019, 6:04pm

As an addendum to the above, I want to clarify that I think that tree-sitter is, in itself, a perfectly good system. I like algorithmic complexity. I also generally prefer declarative systems (e.g. grammars) to procedural ones (e.g. automata). My technical concern about tree-sitter is that it is inflexible. I don’t see how the GLR algorithm could be extended to handle heredocs, whitespace, or other non-context-free features. If there is existing research on this, I would be interested to read it.

My other concern is that while tree-sitter is a fine system for producing high-quality syntax definitions, I’m not convinced that it’s a good system for producing okay-quality syntax definitions. I review packages submitted to the Package Control channel; every week, there are new syntaxes for languages or files I’ve never heard of. Most of these are simple sublime-syntax files written by non-experts solving a single problem. The sublime-syntax system seems to be a good match for this. Time will tell whether tree-sitter can fill the same role.

**srbs** · January 21, 2019, 1:50am

Not to detract or tangent from the conversation, but this reminds be of a great XKCD comic:

**fred.curts** · January 21, 2019, 2:06am

That’s a fun comic, but it doesn’t apply here. There was a de-facto standard (TextMate) which worked fine for a while. Now that it’s dead, a new standard is needed.

**fred.curts** · January 21, 2019, 2:14am

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

Atom is moving to tree-sitter and no longer maintains its TextMate grammars. VSCode is stuck on TextMate grammars, which all the horrors that this entails (>500 issues reported for syntax highlighting of TypeScript alone (!!!), ever more issues filed for TextMate grammars used by VSCode and no longer maintained by Atom). Sublime its betting on its own (from what I understand proprietary) highlighting system. As a language author, I can’t and don’t want to maintain three different grammars.

**kingkeith** · January 21, 2019, 3:21am

I never understood why Atom and VSCode decided to use TextMate grammars when .sublime-syntax files were already mature before those editors were conceived

**ThomSmith** · January 21, 2019, 3:42am

If the goal is standardization, Atom would have been better off using sublime-syntax rather than making their own system that no one else supported. (I get the impression that the authors of tree-sitter didn’t really understand sublime-syntax and believed that it was basically a thin layer of syntactic sugar over tmLanguage.)

Sublime its betting on its own (from what I understand proprietary) highlighting system.

“Proprietary” in the sense that other editors would have to implement it themselves, but that’s not exactly difficult. A sublime-syntax definition is basically a spec for a DPDA that will parse the code. It’s an incredibly simple system, particularly next to the complexity of tree-sitter. There’s an open-source implementation written in Rust. Heck, I wrote a Node.js implementation once because I was bored, and it’s faithful enough that I discovered several new bugs in the process. Despite this simplicity, it is comparable in power to tree-sitter, extremely fast, and amenable to various avenues of extension.

Myself, I always expected Atom and VSCode to implement sublime-syntax. I was quite surprised when Atom went with its own, incompatible system instead. If VSCode ever implements a more modern system, I would be equally surprised if they implemented tree-sitter over sublime-syntax. (I wouldn’t be surprised if they implemented another new, incompatible format; it’s Microsoft, after all.)

I feel confident in predicting Sublime isn’t going to implement tree-sitter. It’s not inherently better than sublime-syntax, and the one thing that tree-sitter does that sublime-syntax doesn’t — nondeterministic parsing — could be implemented in sublime-syntax anyway. It’s a much more complicated system with more moving parts. It can’t handle some common context-sensitive constructs that sublime-syntax can, and there’s no obvious way to extend tree-sitter itself. The custom C lexers used by many Atom syntaxes probably wouldn’t port, making it hard to reuse existing definitions.

I think it’s far more likely that either someone will write a converter from tree-sitter to sublime-syntax definitions, or Atom will implement sublime-syntax alongside tree-sitter.

**fred.curts** · January 21, 2019, 4:00am

Here’s my prediction: VSCode and github.com will follow Atom and support tree-sitter. Nobody will bother to maintain TextMate or sublime-syntax files. Let’s check back in a year!

**Mitranim** · January 21, 2019, 10:53am

If the problem is ignorance of sublime-syntax, perhaps we should reach out to Atom and VS Code authors, particularly the latter, and enlighten them. I could see them being wary of reverse-engineering this proprietary system; an official statement from the HQ might be helpful there.

**ThomSmith** · January 21, 2019, 3:13pm

I’ve had a conversation with the author of tree-sitter.

I don’t get the impression that Atom decided to improve their highlighting, surveyed the options, and made a conscious decision to go with tree-sitter over sublime-syntax. Rather, someone wanted to implement tree-sitter, and no one offered to implement sublime-syntax. I took a quick look, and I didn’t see any issues that even mentioned sublime-syntax, so I took the liberty of opening one.

**braver** · January 22, 2019, 6:47am

You underestimate how old the Atom project is. There were bits and pieces of it at GitHub long before ST3 seemed to actually go anywhere (remember about 5 years ago you’d have to be quite a fan to expect ST to live on) and sublime-syntax was introduced. Also long before it was public.

Why on earth VS Code went with tmLanguage though, I don’t understand. It’s a bit naive though, to expect them, or anyone else, to converge towards a new system just because it exists.

**ThomSmith** · January 22, 2019, 2:56pm

Update: one of the Atom devs closed the issue because:

We don’t feel that adding a third grammar system to Atom is going to be beneficial to the ecosystem as a whole.

No idea who “we” is, or how they came up with that decision, or when. Instead, the Atom dev linked me to the other discussion I linked here. I’m not sure that he noticed that I was part of that that discussion.

**braver** · January 23, 2019, 8:30am

Well GitHub has a core team assigned to Atom development and they don’t necessarily communicate everything via issues or pull requests. It’s an open source project, but very strongly driven by GitHub, not by “the community”. They probably had a meeting

**eproxus** · January 23, 2019, 11:31am

If true, it makes their argument of “why Atom should be bossed by a proprietary company” quite a lot weaker though

**braver** · January 23, 2019, 11:35am

It’s never in doubt that Atom only exists because GitHub thinks it’s useful to spend a couple million a year on it. Same as VS Code and MS. I guess “proprietary company” here means any company other than the one who pays the bills. It’s no different from how any other large open source project is ran though, although the “community” has a tendency to be “surprised” (if not outraged) whenever anyone breaks the spell.

**kylebebak** · February 8, 2024, 5:51pm

For anyone that finds this thread, there’s now a TreeSitter package for Sublime Text: https://github.com/sublime-treesitter/TreeSitter.

It provides Sublime Text with a performant and flexible interface to Tree-sitter. It works out of the box with around 40 languages, and can be configured to work with any language that has a Tree-sitter grammar.

Tree-sitter builds a parse tree for text in any buffer, fast enough to update the tree after every keystroke. Sublime already has a great syntax highlighting system, but Tree-sitter parse trees can be used for a bunch of other things as well.

The package ships with commands for “tree-based” selection and navigation. For example, you can select ancestor, descendant, or “cousin” nodes based on the current selection. This e.g. makes it easy to select the whole class or function if your cursor is currently “within” that class or function. You can also go to symbols returned by configurable tree queries, with symbol breadcrumbs for context.

And it exports APIs that let package developers build Tree-sitter based packages and custom commands.

**distefam** · February 9, 2024, 3:19pm

The commands are useful, however, the real benefit (to me, at least) is to have better scope detection for plugins, especially themes.

Will having the TreeSitter plugin change the scope of different nodes when installed so that themes will automatically pick them up? Or, is its representation of nodes entirely distinct from the built-in scopes?

**deathaxe** · February 9, 2024, 4:12pm

Plugins can’t modify syntax highlighting. They can add regions, which may to some extend may be abused to add something like semantic highlighting, but that may tank performance on large source codes.

Tree-sitter support