Sublime Forum

Syntax Fun

#22

FYI I’ve been using a makeshift test suite for indentation tests and language/syntax tests.

See github.com/gerardroche/sublime- … IBUTING.md

The latest work is done in the develop branch
I’ll be pushing some helper dev tools for creating language test files shortly, although maybe any more work is moot at this point

Examples:

--TEST--
Basic function support
--FILE--
<?php

array_map();
str_shuffle();
\str_shuffle();

?>
--EXPECT--
equal:2:0:source.php source.php.embedded.block.html support.function.array.php
match:2:0:source.php support.function
match:2:0:support.function
match:3:0:source.php support.function.string.php
match:4:0:source.php punctuation.separator.inheritance.php
match:4:1:source.php support.function.string.php
--TEST--
Basic namespace
--FILE--
<?php

namespace Name;

?>
--EXPECT--
source.php source.php.embedded.block.html punctuation.section.embedded.begin.php
source.php source.php.embedded.block.html punctuation.section.embedded.begin.php
source.php source.php.embedded.block.html punctuation.section.embedded.begin.php
source.php source.php.embedded.block.html punctuation.section.embedded.begin.php
source.php source.php.embedded.block.html punctuation.section.embedded.begin.php
source.php source.php.embedded.block.html
source.php source.php.embedded.block.html
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php keyword.other.namespace.php
source.php source.php.embedded.block.html meta.namespace.php
source.php source.php.embedded.block.html meta.namespace.php entity.name.type.namespace.php
source.php source.php.embedded.block.html meta.namespace.php entity.name.type.namespace.php
source.php source.php.embedded.block.html meta.namespace.php entity.name.type.namespace.php
source.php source.php.embedded.block.html meta.namespace.php entity.name.type.namespace.php
source.php source.php.embedded.block.html punctuation.terminator.expression.php
source.php source.php.embedded.block.html
source.php source.php.embedded.block.html
source.php source.php.embedded.block.html punctuation.section.embedded.end.php source.php
source.php source.php.embedded.block.html punctuation.section.embedded.end.php

See github.com/gerardroche/sublime- … IBUTING.md

0 Likes

#23

Best thing I heard in a long time. Thanks, Jon!

0 Likes

#24

Yay, we’ll finally be able to properly highlight setex headers in Markdown.

0 Likes

#25

Aside from the great news about this feature, I’d like to thank you for giving us hindsight into the next build timeframe.

0 Likes

#26

Looking forward to this!

0 Likes

#27

Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines can’t influence the prior lines.

0 Likes

What class of grammars can the `.sublime-syntax` parser capture?
#28

Docs are up: sublimetext.com/docs/3/syntax.html

0 Likes

#29

Btw, Jon, could it be possible to differentiate between .c and .h (or any body and header type) to, then, be able to assign an unique icon to each of them? It’s not that important, but it would be neat to provide visual difference :v

I don’t know if that differentiation is a matter of syntax definiton or tmPreferences, but right now, I don’t think it’s possible with the current implementation…

0 Likes

#30

With the current icon system, it’d require header files to have a separate syntax definition (which would presumably just include the regular one), which wouldn’t be unhandy in any case.

0 Likes

#31

Interesting. The push and pop stuff reminds of how pygments does their syntax highlighter. I guess it’s not an uncommon approach. Look forward to playing with it.

0 Likes

#32

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.

0 Likes

Tmlanguage to sublime-syntax convertor
#33

Textmate 2 has nice feature that I’d like to see in SublimeText and that’s being able to generate the syntax name base on what you’re matching. here is a cool example from Markdown syntax in textmate:

heading = {
  name = 'markup.heading.${1/(#)(#)?(#)?(#)?(#)?(#)?/${6:?6:${5:?5:${4:?4:${3:?3:${2:?2:1}}}}}/}.markdown';
  begin = '(?:^|\G)(#{1,6})\s*(?=\S^#]])';
  end = '\s*(#{1,6})?$\n?';
  captures = { 1 = { name = 'punctuation.definition.heading.markdown'; }; };
  contentName = 'entity.name.section.markdown';
  patterns = ( { include = '#inline'; } );
}

It adds scope names like markup.heading.2.markdown and markup.heading.3.markdown depending on the heading level of the markdown. I’ve recreated these scopes by duplicating the match definition 6 times.
I would be awesome to have some basic string replacement capabilities based on the match regex in scope name.

1 Like

#34

[quote=“jps”]

Yes, btw: there will be an automatic converter from .tmLanguage to .sublime-syntax files.[/quote]

Excellent. Since the new system looks much nicer to work with, I hope this will stimulate community involvement in the creation of new and bugfix/update of existing language definitions.

To return to an earlier point, I’m wondering just how much room there is for improvement in Sublime’s syntax parser and what benefits such improvements could bring. I referred previously to expanding APIs in this area to facilitate the development of language aware plugins like code-refactoring, but it also struck me that more precise and detailed scopes could allow for the implementation of useful editor functions, for example “goto reference” (inverse of goto definition), in a lightweight manner. Thoughts?

0 Likes

#35

[quote=“jps”]

Unfortunately not. It can only recognise LL(1) grammars, so the contents of the following lines can’t influence the prior lines.[/quote]

Do you know if an LL(1) parser is sufficient for syntax-highlighting C++? I know that C++ can’t be lexed with anything less than a Turing Machine but I don’t if syntax highlighting is hard as lexing.

Great work so far! I absolutely love the new syntax file spec. I’ve been terrified about writing syntax files till now but I think I will give it another go once this is done. There are a lot of issues with C++ syntax highlighting that I want to fix. :smile:

0 Likes

#36

First of, awesome!
Secondly, but muh time. :frowning:

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. It’s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.

By the way, SyntaxHighlightTools goes in a different direction that defines a custom YAML-like syntax but it’s not really suited for the context model imo.
It’s nice that it bundles the tmPreferences stuff however which “works” but is still pretty clunky. It’s afaik the only extension file type that is still only available using ye old textmate-compatible plists besides tmTheme. Any plans for working on that in the future?

That sounds great! I would also like to know how much the usage of atomic groups for the popular keyword matches would improve performance compared to normal grouped matches because I never see them used anyway, besides maybe my own syntaxes.

find_by_selector or extract_scope are really useful in that regard, but sometimes a slightly better syntax crawling could be useful. Something similar to what jQuery does maybe with the parent, child or sibling queries. This should all be doable with the current APIs and some neat algorithms however, except for the weird case that facelessuser described. If I ever needed something more complex I’d likely write some of these algorithms myself but it would be interesting to know how taxing these two API calls are (for large files) and if it makes sense to implement more performant variants in ST itself.

Now to my own feedback regarding the new syntax format, which is a bit on the technical part.

  1. The possibility to match multi-line stuff with barer context control is great.

  2. The possibility to more or less “inject” matches into included contexts or other languages is also great with the prototype stuff.
    Have you thought of including/pushing a specific context of a different syntax definition? Such as push: "Packages/HTML/HTML.sublime-syntax:tag" or using another character that is usually invalid in filepaths. That would allow more flexible and organised syntax management and not require to define many multiple hidden syntax definitions.

  3. Would it be feasable to allow some kind of reflexive context-awareness for these with_prototype-injected patterns? E.g. some way to only apply the pattern if “string” matches the entire current scope or only the deepest scope name. Or maybe just make it apply only in certain contexts.

  4. You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent context’s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?

  1. Are named capture groups an option instead of always using numbers? It would surely be cleaner to have named capture groups and assign scope names to those groups by their name instead of the number sometimes. Oniguruma seems to support this according to the spec but it’s not enabled for syntax definitions and I’m don’t know if the options are exclusive.

  2. What happens when both YAML.tmLanguage and YAML.sublime-syntax are present?

  3. The current YAML syntax def is still not exactly correct, but its better than the previous one. I’ll wait until you settled the “bringing the syntax definitions up-to-date” (which hopefully results in some form of open sourcing for contributions) for fixing, however.

0 Likes

Will there be a .sublime-something replacement for the .tmPreferences file format?
#37

[quote=“FichteFoll”]First of, awesome!

At this stage, no. YAML if a great format if you’re familiar with it, but the syntax can be quite unintuitive at times (e.g., exactly when a string needs to be quoted is not always obvious). I’m not ruling it out, it’s just something I’m approaching with caution.

This is somewhat true, sadly. Once you understand the rules behind it it becomes rather simple and when assisted by a linter (such as sublimelinter-pyyaml) you can probably catch most of the bad unquoted strings easily before you run into issues. It’s not as straight forward as JSON however.
Anyway, I love YAML (which is why .YAML-tmLanguage exists), so I embrace this change.
[/quote]

Now that I’ve spent more time working with YAML, I’m still not sure how I feel about it. I don’t like the way if effectively forces 2 space indenting, and I don’t like that there’s just so much stuff in the spec, but once you know what’s going on it is nice to write. It’s easy enough to justify for things like syntax definitions, but not so much for everything else. Who would ever guess that double quotes and single quotes have different escaping rules, for example?

It’s not mentioned in the documentation, but you can use contexts in other .sublime-syntax files in the same way as local ones (e.g., push them or include them). They can be referenced via “Packages/Foo/Foo.sublime-syntax”, “Packages/Foo/Foo.sublime-syntax#main” (#main is implied if not present), or “Packages/Foo/Foo.sublime-syntax#strings”.

Currently the named context, and the transitive closure of the contexts it references are cloned and rewritten with the patterns mentioned in with_prototype. I’m open to extending this so that some contexts won’t be rewritten, but a use case would be nice :slight_smile:

[quote=“FichteFoll”]You have an example where a context match accesses the matched groups of the match that poped it with a \1 backref (heredocs). I suppose this only works with the parent context’s match, but this assumes that the parent context has a match in the first place, correct?

What about the normal Oniguruma syntax of \n (where n is any number to back-reference a previous group match)?
I know that this was also possible previously with begin-end-patterns but I never tested how it worked. Are matches in the end pattern simply appended matches of the begin pattern?

What will patterns backref if a context was changed using the set key?
[/quote]

If the pop pattern (this processing is only done to the first, btw) has any backrefs, then when the context is pushed onto the stack, the pop pattern will be rewritten with the values that the push or set captured when the context was pushed onto the stack. If the regex used to push the context doesn’t define a corresponding capture, then the empty string is substituted.

This also has the implication that pop regexes are unable to use backrefs in the normal way, as they’ll be rewritten before the regex engine has a chance to see them. You can work around this with named captures though, which are ignored by the pop regex rewriting logic.

With regards to set vs push, they operate in exactly the same way, set just pops the current context and then pushes the given context(s) on the stack.

In principle they could be, I just haven’t done the work to enable them.

No special handling, they’ll just appear in the menu as two separate syntax definitions. My current recommendation is to leave both, but mark the tmLanguage as hidden.

That doesn’t surprise me, YAML is not easy to get right!

1 Like

[sublime-syntax] Allow \1 in patterns that don't pop
#38

Just a heads up, there may be some changes to the regex flavour used by sublime-syntax files in the next build, so you may not want to spend too much time playing with them in the current form.

0 Likes

#39

I just discovered the conversion code and inspected a bit. (Initially because it didn’t work for me.)

  1. The reason why I looked into it in the first place: It doesn’t work for me.

Traceback (most recent call last): File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 535, in run_ return self.run() File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 415, in run File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 344, in convert File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 205, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 264, in make_context File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 127, in format_external_syntax File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 113, in syntax_for_scope File "convert_syntax in C:\Program Files\Sublime Text 3\Packages\Default.sublime-package", line 103, in build_scope_map File "./plistlib.py", line 104, in readPlistFromBytes raise AttributeError(attr) File "./plistlib.py", line 76, in readPlist # File "./plistlib.py", line 378, in parse raise ValueError("unexpected key at line %d" % xml.parsers.expat.ExpatError: not well-formed (invalid token): line 233, column 24

So, it appears that I have an invalid plist file in my packages somewhere. This needs to be caught and the conversion must continue. Furthermore, it would be nice if the name of the badly formatted file was displayed.

The format_regex function strips all leading whitespace of regular expressions and destroys indentation. I suggest using a function that detects the base indentation of a string (while overseeing the first line since that will most likely start with (?x)) and then strip only that indent instead of everything. Something like

def format_regex(s): if "\n" in s: lines = s.splitlines(True) s = lines[0] + textwrap.dedent(''.join(lines[1:])) s = s.rstrip("\n") + "\n" return s

You use plistlib for parsing. I’ve had problems
with that before, notably that import xml.parsers.expat failed on some linux distros which plistlib uses. This was all on ST2 though, but you might still want to look into it. I also had problems with plistlib sometimes reporting an “unknown encoding” for files with <?xml version="1.0" encoding="UTF-8"?>. I have no idea why that would have happened since that’s what it emits with plistlib.writePlist too.

ConvertSyntaxCommand is a WindowCommand, despite only really operating on a view. Is there a certain reason for this? If it’s because TextCommands automatically come with an edit parameter (and create an undo point, though I’m not sure about that) then I propose to add a new “ViewCommand” that has the view object in self.view but does not automatically create an edit. The reason is that window.active_view() is not “portable” and doesn’t work as expected if run with some_view.run_command("") and some_view is inactive.

  1. import yaml should really be part of the if __name__ == "__main__": section

  2. Edit: I discovered the $base include a while ago but never found out what it actually does. Considering it’s translated to $top_level_main I assume that this is only relevant if the syntax def this include is in has been included from externally?

By the way, I’m willing to work on an “official” Python.sublime-syntax file. I have a lot of experience with syntax definitions in general and hacked on the PythonImproved syntax to decent extend. Because of copyright issues I’d start from scratch, which would also result in a clean structure I suppose.

0 Likes

#40

Excellent work Jon, already much better than working with the tmLanguage files.

Would it be possible to ignore spaces in the regular expressions? This would help to break up regular expressions into groups to make reading and editing them easier. Currently they are just big regexes on one line. If I have missed a way to break them up visually please let me know.

0 Likes

#41

@315234 (awesome name): Use the (?x) switch. It will ignore all spaces after it, except for those within character sets. geocities.jp/kosako3/oniguruma/doc/RE.txt

0 Likes