Sublime Forum

Tree-sitter support

#8

A good explanation is Jim Roskind on C ambiguity.

A tree-sitter is basically a GLR parser, and GLR parsers are context-free. Therefore, it’s impossible for tree-sitter to handle this kind of ambiguity. From the link:

Other threads of response included the use of a GLR parser, which tries all alternatives. Alas, as will be shown, the ambiguity in C is too great, and such an approach will not provide a unique syntactically correct parse when it does exist.

Since one of the respondees included suggesting that a GLR parser could just try all interpretations, and that only one would be valid, the following is a rather interesting example:

typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}


>   Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were still a typedef-name, as a cast to type T of the expression "+1".  Note that the GLR approach would yield two distinct and valid parses, and hence there would be no single winner (without additional heuristics).

Real C parsers handle the ambiguity in the lexical analysis stage. They build up a symbol table as they go along, so they can tell when they encounter a name whether it belongs to a type or a variable.

[Tree-sitter's documentation](http://tree-sitter.github.io/tree-sitter/creating-parsers#lexical-analysis) indicates that the algorithm does separately perform lexical analysis, but as far as I can tell there is no way to write a custom lexer that maintains an internal symbol table and uses that table to distinguish variables from types. In any event, a custom lexer of that sort probably couldn't be compiled in the same way that the grammar is, and even if it could it would mean an extra layer of abstraction between the code and the parser, which wouldn't be great for performance. Plus, I'm skeptical that arbitrary user-defined lexers would be compatible with tree-sitter's reparsing guarantees, but I could be wrong about that.

I suspect that Sublime's engine actually could be augmented to handle C types. Sublime's syntax definitions are essentially descriptions of a pushdown automaton. Executing a PDA is very simple and “dumb”, so you're not likely to run into thorny algorithmic dilemmas when extending its behavior. In the [nondeterministic parsing proposal](https://github.com/SublimeTextIssues/Core/issues/2241), I identify only one place where the proposed behavior might interfere with existing behavior, and it's a performance optimization that could be remedied. The tree-sitter algorithm is much more sophisticated, which makes it potentially brittle. Adding context-sensitive features would likely violate core invariants of the algorithm, and it might not be easy or even possible to remedy those invariants.

For instance, I can't find what regex flavor tree-sitter uses, but I suspect that it doesn't support non-regular features like backreferences. Sublime does — it has its own highly-optimized regex engine that only handles truly regular expressions, but it will fall back to Oniguruma to handle non-regular features. There is a performance penalty, but when used sparingly backreferences add real expressive power to the engine that I'm not sure tree-sitter can match.
4 Likes

#9

If that’s the case, I’d still expect it to be slower than Sublime’s engine. JavaScript compiled to C is still going to have some overhead compared to ordinary C.

Your understanding is wrong.

JavaScript is only used as a DSL for writing grammars. (Alternatively, grammars can be written in plain JSON.) From the grammar a C parser is generated. No JavaScript is ever compiled to C.

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting. The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

PS: A Rust rewrite of tree-sitter’s JavaScript tooling is underway.

1 Like

#10

The significance of tree-sitter is that it will hopefully become a new cross-editor standard for syntax highlighting.

Perhaps. I see three chief obstacles to this happening:

  1. The tree-sitter algorithm is much more complicated than a simple PDA-based highlighter despite being fundamentally no more powerful. (As I’ve mentioned, nondeterminism could be added to the sublime-syntax system without significantly changing it.)
  2. The sublime-syntax system supports some non-context-free features that the GLR algorithm cannot ever replicate. The tree-sitter implementation allows developers to accommodate these features by writing custom C extensions, which is inconvenient to say the least.
  3. The current cross-editor standard is tmLanguage. Authoring a sublime-syntax is very similar to authoring a tmLanguage, especially considering that many tmLanguage authors use the YAML representation anyway. Given a tmLanguage, an automated tool can convert it to a sublime-syntax, and the author can add whatever degree of syntactic sophistication is appropriate. The migration path for tree-sitter is unclear: an author must write a brand-new syntax definition using a different paradigm, and the grammar must be complete (if not comprehensive).

If tree-sitter really takes off, and there are high-quality tree-sitter definitions that would be useful to Sublime users, then I expect that someone would probably develop an automated tool to convert (deterministic) tree-sitter grammars to sublime-syntax definitions. I could be misremembering, but I believe that it’s generally easier to convert a grammar to an automaton than vice versa.

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

3 Likes

#11

As an addendum to the above, I want to clarify that I think that tree-sitter is, in itself, a perfectly good system. I like algorithmic complexity. I also generally prefer declarative systems (e.g. grammars) to procedural ones (e.g. automata). My technical concern about tree-sitter is that it is inflexible. I don’t see how the GLR algorithm could be extended to handle heredocs, whitespace, or other non-context-free features. If there is existing research on this, I would be interested to read it.

My other concern is that while tree-sitter is a fine system for producing high-quality syntax definitions, I’m not convinced that it’s a good system for producing okay-quality syntax definitions. I review packages submitted to the Package Control channel; every week, there are new syntaxes for languages or files I’ve never heard of. Most of these are simple sublime-syntax files written by non-experts solving a single problem. The sublime-syntax system seems to be a good match for this. Time will tell whether tree-sitter can fill the same role.

4 Likes

#12

Not to detract or tangent from the conversation, but this reminds be of a great XKCD comic:

2 Likes

#13

That’s a fun comic, but it doesn’t apply here. There was a de-facto standard (TextMate) which worked fine for a while. Now that it’s dead, a new standard is needed.

0 Likes

#14

The current situation, with Atom, VSCode, and Sublime each preferring a different syntax highlighting system, is not sustainable.

I’m not sure what you mean by this.

Atom is moving to tree-sitter and no longer maintains its TextMate grammars. VSCode is stuck on TextMate grammars, which all the horrors that this entails (>500 issues reported for syntax highlighting of TypeScript alone (!!!), ever more issues filed for TextMate grammars used by VSCode and no longer maintained by Atom). Sublime its betting on its own (from what I understand proprietary) highlighting system. As a language author, I can’t and don’t want to maintain three different grammars.

1 Like

#15

I never understood why Atom and VSCode decided to use TextMate grammars when .sublime-syntax files were already mature before those editors were conceived

1 Like

#16

If the goal is standardization, Atom would have been better off using sublime-syntax rather than making their own system that no one else supported. (I get the impression that the authors of tree-sitter didn’t really understand sublime-syntax and believed that it was basically a thin layer of syntactic sugar over tmLanguage.)

Sublime its betting on its own (from what I understand proprietary) highlighting system.

“Proprietary” in the sense that other editors would have to implement it themselves, but that’s not exactly difficult. A sublime-syntax definition is basically a spec for a DPDA that will parse the code. It’s an incredibly simple system, particularly next to the complexity of tree-sitter. There’s an open-source implementation written in Rust. Heck, I wrote a Node.js implementation once because I was bored, and it’s faithful enough that I discovered several new bugs in the process. Despite this simplicity, it is comparable in power to tree-sitter, extremely fast, and amenable to various avenues of extension.

Myself, I always expected Atom and VSCode to implement sublime-syntax. I was quite surprised when Atom went with its own, incompatible system instead. If VSCode ever implements a more modern system, I would be equally surprised if they implemented tree-sitter over sublime-syntax. (I wouldn’t be surprised if they implemented another new, incompatible format; it’s Microsoft, after all.)

I feel confident in predicting Sublime isn’t going to implement tree-sitter. It’s not inherently better than sublime-syntax, and the one thing that tree-sitter does that sublime-syntax doesn’t — nondeterministic parsing — could be implemented in sublime-syntax anyway. It’s a much more complicated system with more moving parts. It can’t handle some common context-sensitive constructs that sublime-syntax can, and there’s no obvious way to extend tree-sitter itself. The custom C lexers used by many Atom syntaxes probably wouldn’t port, making it hard to reuse existing definitions.

I think it’s far more likely that either someone will write a converter from tree-sitter to sublime-syntax definitions, or Atom will implement sublime-syntax alongside tree-sitter.

2 Likes

#17

Here’s my prediction: VSCode and github.com will follow Atom and support tree-sitter. Nobody will bother to maintain TextMate or sublime-syntax files. Let’s check back in a year!

1 Like

#18

If the problem is ignorance of sublime-syntax, perhaps we should reach out to Atom and VS Code authors, particularly the latter, and enlighten them. I could see them being wary of reverse-engineering this proprietary system; an official statement from the HQ might be helpful there.

2 Likes

#19

I’ve had a conversation with the author of tree-sitter.

I don’t get the impression that Atom decided to improve their highlighting, surveyed the options, and made a conscious decision to go with tree-sitter over sublime-syntax. Rather, someone wanted to implement tree-sitter, and no one offered to implement sublime-syntax. I took a quick look, and I didn’t see any issues that even mentioned sublime-syntax, so I took the liberty of opening one.

4 Likes

#20

You underestimate how old the Atom project is. There were bits and pieces of it at GitHub long before ST3 seemed to actually go anywhere (remember about 5 years ago you’d have to be quite a fan to expect ST to live on) and sublime-syntax was introduced. Also long before it was public.

Why on earth VS Code went with tmLanguage though, I don’t understand. It’s a bit naive though, to expect them, or anyone else, to converge towards a new system just because it exists.

3 Likes

#21

Update: one of the Atom devs closed the issue because:

We don’t feel that adding a third grammar system to Atom is going to be beneficial to the ecosystem as a whole.

No idea who “we” is, or how they came up with that decision, or when. Instead, the Atom dev linked me to the other discussion I linked here. I’m not sure that he noticed that I was part of that that discussion.

0 Likes

#22

Well GitHub has a core team assigned to Atom development and they don’t necessarily communicate everything via issues or pull requests. It’s an open source project, but very strongly driven by GitHub, not by “the community”. They probably had a meeting :information_desk_person:

1 Like

#23

If true, it makes their argument of “why Atom should be bossed by a proprietary company” quite a lot weaker though :stuck_out_tongue:

0 Likes

#24

It’s never in doubt that Atom only exists because GitHub thinks it’s useful to spend a couple million a year on it. Same as VS Code and MS. I guess “proprietary company” here means any company other than the one who pays the bills. It’s no different from how any other large open source project is ran though, although the “community” has a tendency to be “surprised” (if not outraged) whenever anyone breaks the spell.

1 Like

#25

For anyone that finds this thread, there’s now a TreeSitter package for Sublime Text: https://github.com/sublime-treesitter/TreeSitter.

It provides Sublime Text with a performant and flexible interface to Tree-sitter. It works out of the box with around 40 languages, and can be configured to work with any language that has a Tree-sitter grammar.

Tree-sitter builds a parse tree for text in any buffer, fast enough to update the tree after every keystroke. Sublime already has a great syntax highlighting system, but Tree-sitter parse trees can be used for a bunch of other things as well.

The package ships with commands for “tree-based” selection and navigation. For example, you can select ancestor, descendant, or “cousin” nodes based on the current selection. This e.g. makes it easy to select the whole class or function if your cursor is currently “within” that class or function. You can also go to symbols returned by configurable tree queries, with symbol breadcrumbs for context.

And it exports APIs that let package developers build Tree-sitter based packages and custom commands.

2 Likes

#26

The commands are useful, however, the real benefit (to me, at least) is to have better scope detection for plugins, especially themes.

Will having the TreeSitter plugin change the scope of different nodes when installed so that themes will automatically pick them up? Or, is its representation of nodes entirely distinct from the built-in scopes?

0 Likes

#27

Plugins can’t modify syntax highlighting. They can add regions, which may to some extend may be abused to add something like semantic highlighting, but that may tank performance on large source codes.

0 Likes