Sublime Forum

Regex modifiers in look behind positive fail

#1

I am trying to capture the parameters of:
function varName(x, y, z)
Where varName can be any variable name and it shouldn’t capture cases that aren’t functions

- match: (?<=function [\w]*\()[a-zA-Z., ]*(?=\))
Doesn’t work cause it won’t accept the * modifier within the look behind positive. What am I supposed to do/use? All these posts about \K being the solution, that doesn’t seem to register at all.

How are you supposed to NOT capture things in general?
i.e. hello
how do I highlight just hllo?
- match: he{0}llo
will highlight hello and hllo in sublime, but it’s only supposed to capture hllo

0 Likes

#2

Lookbehinds should generally be avoided as they trigger old Oniguruma regex engine, which is slower than ST’s default sregex syntax highlighter.

TL;TR: Let’s call lookbehinds deprecated.

The way to go is by pushing proper contexts onto stack when matching/consuming certain patterns.

A common way function calls are handled in many syntax definitions looks as follows.

%YAML 1.2
---
name: C
scope: source.c
version: 2

contexts:
  main:
    - include: function-calls

  function-calls:
    - match: \w+(?=\s*\()
      scope: meta.function-call.identifier.c variable.function.c
      push: function-call-argument-list

  function-call-argument-list:
    - meta_content_scope: meta.function-call.c
    - match: \(
      scope: punctuation.section.arguments.begin.c
      set: function-call-argument-list-body

  function-call-argument-list-body:
    - meta_scope: meta.function-call.arguments.c
    - match: \)
      scope: punctuation.section.arguments.end.c
      pop: 1
    # ...

0 Likes

#3

Can you or someone explain how this works starting at main?

  1. include is really just calling the ‘function-calls’ function in main? (can I not just write it in main? how?)
  2. this 1st function matches the word before a (
  3. scope assigns a color to this word
  4. the word is ‘pushed onto the stack’? what is the stack? regardless It seems to call the next function…
  5. This second function…what is meta_content_scope even included for? this carries over the last match? in what way? u can delete the line and nothing changes. the next meta thing will carry the parameter data over this one tho I don’t see what it’s doing.
  6. We match the first ( now
  7. we give it a color via scope
  8. set is the same as push but “will first pop this context off, and then push the given context(s) onto the stack.”…huh? it removed the context and then pushes the context…why change it from push?
  9. final function, this meta as mentioned b4 seems to grab whatever is between the previous and next match aka the parameters
  10. we match the )
  11. we assign ) a color
    12: we pop: 1 …this pops content off the stack…why is this necessary and what does that mean? Why r there only 1 of these in a syntax file at the very last line? Can you get two to work?

So for my purposes I am trying to get the generic lua syntax file highlighting for parameters to work in the corona sdk syntax file. From your code I now have the following which will colour the parameter data via lua scopes (its changed to omits colouring of the brackets themselves).

main: #working assuming header info is present
    - include: function-calls

  function-calls:
    - match: function \w+(?=\s*\() 
      push: function-call-argument-list 

  function-call-argument-list:
    - match: (?<=\()\w 
      set: function-call-argument-list-body 

  function-call-argument-list-body:
    - meta_scope: variable.parameter.function.lua
    - match: \w(?=\)) 
      scope: variable.parameter.function.lua 
      pop: 1

I want to merge this functionality of coloured parameters into an existing syntax file:


But Whenever I try to plop it into the file ‘somewhere’ the pop thing in mine and theirs seem to conflict as if there can be only 1 pop in the entire thing? There is also going to be redundancy if I try to tack one system next to the other. How would I merge these to get coloured parameters into the existing code? I’d be curious from both a hackjob perspective (plop it in there and redundancies r fine) vs elegant fixing the existing to grab the stuff…I can maybe figure alot of that out myself though if anyone can help with any of the above. Yes I am trying to read the https://www.sublimetext.com/docs/syntax.html, but it makes very little sense to me as it’s written as if I already know how this style of program functions. Like why can’t I just add function-call anywhere on its own? Isn’t it just a function? I get error just adding any function anywhere. And I don’t understand at all how captures: is working

0 Likes

#4

Writing syntax definitions is a complex task and requires quite some patience to get involved.

I hope the following notes help with that a little bit, even though I must confess to probably be too involved to find easy to understand words.

  1. include is really just calling the ‘function-calls’ function in main?

A context is a collection of regular expressions being used to find tokens (e.g. keywords, punctuation, operators, etc.) in an open file’s content.

main is the top-level context being started with.

You could of course try to write all required patterns into main, one after another. That’s what early TextMate syntax definitions actually did. But you already found the resulting limitations :wink:

To modularize a syntax definitions, we can create our own named contexts with logically grouped collections of patterns (like those to find function calls).

The keyword include merges all regular expressions from such user defined context into the one, include appears in.

In following example we just say: Take all patterns from function-calls and add them to main.

main:
    - include: function-calls

  function-calls:
    - match: function \w+(?=\s*\() 
      ...

internally ST resolves include to something like the following:

main:
    - match: function \w+(?=\s*\() 
      ...

You’ll see the value as soon as you need to re-use patterns in different situations, which will become obvious as soon as we start using push or set.

  1. the word is ‘pushed onto the stack’? what is the stack? regardless It seems to call the next function…

Imagine the stack as a list or array, which contains (pointers of) contexts (aka. collections of patterns to find tokens).

The syntax engine always uses the top-most list item’s context to look for tokens.

At the beginning this list only contains a single item main.

The following snippet adds the context function-call-argument-list to that list of contexts as soon as the regular expression \w+(?=\s*\() matches the content at the current search position. After that syntax engine uses patterns from function-call-argument-list instead of main. It looks for ( only in our case.

    - match: \w+(?=\s*\()
      scope: meta.function-call.identifier.c variable.function.c
      push: function-call-argument-list

Now stack contains:

  • function-call-argument-list <- active context
  • main

If it finds ( in content the following pattern sets function-call-argument-list-body onto stack, which means it replaces the currently active context without adding another one to the stack.

    - match: \(
      scope: punctuation.section.arguments.begin.c
      set: function-call-argument-list-body

Now stack contains:

  • function-call-argument-list-body <- active context
  • main

The active context function-call-argument-list-body currently contains a single regular expression to look for ). If a ) appears in text document, pop: 1 is executed, which removes the active context from our stack.

  function-call-argument-list-body:
    - meta_scope: meta.function-call.arguments.c
    - match: \)
      scope: punctuation.section.arguments.end.c
      pop: 1

Now stack contains:

  • main

We are back, where we began, using the initial set of regular expressions.

  1. this 1st function matches the word before a (

Yes, it does.

  1. scope assigns a color to this word

Technically a syntax definitions assigns scopes (names) to tokens (words, keywords etc.).

Color schemes use selectors (expressions of scopes) to identify tokens and assign them a color.

The scope is used for various other things, like folding/goto definition/symbol lists/… as well.

  1. This second function…what is meta_content_scope even included for? this carries over the last match? in what way? u can delete the line and nothing changes. the next meta thing will carry the parameter data over this one tho I don’t see what it’s doing.

meta_scope and meta_content_scope specify scope names, which are applied to whole text content, until the current context is removed from stack.

In my example it causes meta.function-call.arguments.c to be applied to everything within ( and ), without needing to write special regular expressions.

Meta scopes are used to structure the highlighted code and provide information plugins or autocompletion can use to decide what to suggest, etc.

  1. set is the same as push but “will first pop this context off, and then push the given context(s) onto the stack.”…huh? it removed the context and then pushes the context…why change it from push?

Given the above 3 examples, if we’d use push instead of set in the second one, we’d end up with 3 contexts in the stack.

Now stack contains:

  • function-call-argument-list-body <- active context
  • function-call-argument-list
  • main

Calling pop: 1 in 3rd-example would remove one context and syntax engine would continue with function-call-argument-list.

Now stack contains:

  • function-call-argument-list
  • main

… but we want to go back to main immediately after ) was found by function-call-argument-list-body. That’s why we don’t want function-call-argument-list to reside in stack.

In ST4 you could alternatively use push and pop: 2 to achive the same:

  main:
    - match: \w+(?=\s*\()
      scope: meta.function-call.identifier.c variable.function.c
      push: function-call-argument-list
  function-call-argument-list
    - match: \(
      scope: punctuation.section.arguments.begin.c
      push: function-call-argument-list-body   # push
  function-call-argument-list-body:
    - meta_scope: meta.function-call.arguments.c
    - match: \)
      scope: punctuation.section.arguments.end.c
      pop: 2  # requires removing 2 contexts from stack.

What to use absolutely depends on what you are trying to achieve.

It however turned using this alternative being a possible cause of increasing complexity and reducing re-usability of contexts - but it depends.

2 Likes

#5

I don’t understand alot still, but I think I got everything I need working. I was able to fix the corona sdk file. I really appreciate the help. Here are my ramblings leftover if anyone wants to correct me on anything or follow along:

That answered my other question which is good thankyou (now deleted). Mainly that I had expanded your example to two separate match cases but it wouldn’t do both just one or the other. I had them both at ‘pop: 1’ instead of the fixed working version below. But how do I remember how many things I have on the stack? Can I overpop? make it pop 20 and pop 30…it seems to work the same? Given that I am still using main (to match “cat”) after overpopping, that must mean u can’t pop main right? So can I just overpop if I don’t know? And Is it just about popping as many times as u push (pushing is the only way the stack enlarges?)? So I do indeed just need pop 1 and pop 2 in the below…and if you push a dozen times in one slew of operations and ‘set’ one time u just need to ‘pop: 1’ I guess.

name: Lua test
comment: "Lua Syntax for Solar2D: version 0.2 (after Lua Syntax: version 0.8)"
file_extensions:
  - lua
scope: source.lua.corona
contexts:
  main:
    - include: function-comments-1 #colourize block comments
    - include: function-parameters-1 #colourize function name and para
    - match: "cat" #colourize cat
      scope: constant.numeric.lua
  function-comments-1:
    - match: '--\[(=*)\['
      scope: constant.language.lua
      push: function-comments-2
  function-comments-2:
    - meta_scope: keyword.operator.lua
    - match: ']]'
      scope: support.function.lua
      pop: 1

  function-parameters-1:
    - match: \w+(?=\s*\() 
      scope: variable.parameter.function.lua
      push: function-parameters-2
  function-parameters-2:
    - match: \(
      scope: support.function.lua
      push: function-parameters-3
  function-parameters-3:
    - meta_scope: variable.parameter.function.lua
    - match: \)
      scope: support.function.lua
      pop: 2

*I’m rewriting the rest of this post since I seem to have figured out a bunch of things

What does capture do? My current understanding is it matches each match’s capture groups to specific scope? So I can colour each specific part of a regex.

 - match: ^\\s*(#)\\s*\\b(include)\\b
  captures:
    1: meta.preprocessor.c++
    2: keyword.control.include.c++

Is from the docs https://www.sublimetext.com/docs/syntax.html). So is meta.preprocessor.c++ a scope that colours the #? and keyword.control.include.c is a scope colouring the word include?

On the corona sdk syntax file it seems to be this series of captures that controls the ability to colourize the parameters, but it seems to be broken or incomplete. The match they use bypasses my OP problem because they are matching the entire thing with all its parts, they don’t try to ignore any components via look ahead or behind like (?<=)…so they don’t have to worry about modifiers not working in them. And this is done via regex capture groups ()…which ‘capture:’ will then apply specific scopes to each. u can also apply a general scope to everything if desired like u see below so u can stack scopes.

- match: '(\w+)?(?:\s*=\s*)*(function\s*)(\w+\s*[:.]+)*(?:\s+)*(\w+)*\s*(\()[^)]*\)'
  comment: 'Find the various kinds of function definition in Lua: for any line containing "function", match optional "local" and/or "function", match identifier with optional classname, discard optional "=", match optional "function", match "(", match everything until closing ")", match ")"'
  scope: meta.function.lua #variable.parameter.function.lua will italic function, and orange brackets, and parameters...but only want parameters italic and orange
  captures:
    1: entity.name.function.scope.lua
    2: keyword.control.lua  #colours function keyword
    3: entity.name.function.scope.lua 
    4: entity.name.function.scope.lua  #colours the function name
    5: punctuation.definition.parameters.begin.lua #first bracket
    6: variable #broken?
    7: punctuation.definition.parameters.end.lua #last bracket seems broken?

okay I got it figured out…it does match the capture groups, its just the 6 and 7 groups literally don’t exist so they don’t do anything aka it’s bad code as far as I can tell…

- match: '(\w+)?(?:\s*=\s*)*(function\s*)(\w+\s*[:.]+)*(?:\s+)*(\w+)*\s*(\()([^)]*)(\))'
  comment: 'Find the various kinds of function definition in Lua: for any line containing "function", match optional "local" and/or "function", match identifier with optional classname, discard optional "=", match optional "function", match "(", match everything until closing ")", match ")"'
  scope: meta.function.lua #variable.parameter.function.lua will italic function, and orange brackets, and parameters...but only want parameters italic and orange
  captures:
    1: entity.name.function.scope.lua
    2: keyword.control.lua  #colours function keyword
    3: entity.name.function.scope.lua 
    4: entity.name.function.scope.lua  #colours the function name
    # 5: variable.parameter.function.lua #punctuation.definition.parameters.begin.lua #first bracket
    # 6: variable #not doing anything
    6: variable.parameter.function.lua #last bracket seems broken
0 Likes

#6

…how do I remember how many things I have on the stack?

Well, by understanding and thinking like ST’s lexer - the part of the syntax engine which parses the text and creates the tokens using the patterns from a syntax definition.

I personally build mental routes in my head to track, what’s happening next when applying a push/set or pop operation.

It is probably too much to explain here, but basically most of it is about strategy how design structure of a syntax definition. You can do all sorts of weired things, ending in uncontrollable spagettie. That’s what early sublime-syntax definitions looked like.

Finally however - once you’ve written some for various different languages :sweat_smile: - you’ll recognize some sort of standard scheme most of them follow.

Keep it as simple as possible to be able to follow the tracks. That’s one reason why I’d suggest to keep using push, set and pop: 1 whenever possible. In rare cases pop: 2 might help, but anything else can and should be avoided as it increases complexity and easily causes you loosing track of what’s happening.

Given that I am still using main (to match “cat”) after overpopping, that must mean u can’t pop main right? So can I just overpop if I don’t know?

Yes, has a safe guard to prevent popping main.

… push a dozen times in one slew of operations and ‘set’ one time u just need to ‘pop: 1’ I guess.

Not sure I understand that correct.

In the following example you’d add foo-body and bar-body. By using set baz-body bar-body is overwritten, so it still requires 2 contexts to be removed to immediatelly return to main.

main:
  - match: foo
    push: foo-body   # adds foo-body to stack

foo-body:
  - match: bar
    push: bar-body   # adds bar-body to stack
  - match (?=\S)
    pop: 1

bar-body:
  - match: baz
    set: baz-body    # replaces foo-body by bar-body
  - match (?=\S)
    pop: 2

baz-body:
  - match: ;
    pop: 2

You can replace set by push and increase pop count, to get same result.

main:
  - match: foo
    push: foo-body   # adds foo-body to stack

foo-body:
  - match: bar
    push: bar-body   # adds bar-body to stack
  - match (?=\S)
    pop: 1

bar-body:
  - match: baz
    push: baz-body    # adds bar-body to stack
  - match (?=\S)
    pop: 2   # from her needs to pop 2 contexts to return to main, immediatelly

baz-body:
  - match: ;
    pop: 3   # from her needs to pop 3 contexts to return to main, immediatelly

But generally speaking, this is a design, I’d try to avoid as messing with different pop: n values makes syntax structure complicated and decreases re-usability of contexts.

A common scheme would be to use push...set...set...set...set...pop: 1.

main:
  - match: foo
    push: foo-body   # adds foo-body to stack

foo-body:
  - match: bar
    set: bar-body    # replace foo-body by bar-body
  - match (?=\S)
    pop: 1

bar-body:
  - match: baz
    set: baz-body    # replaces foo-body by bar-body
  - match (?=\S)
    pop: 1

baz-body:
  - match: ;
    pop: 1

Modern syntaxes also push multiple contexts on stack and pop them one after another to achieve tha same, but with maximum re-usability of contained contexts.

main:
  - match: foo
    push: 
      - baz-body    # continue here if `bar-body` was popped
      - bar-body    # continue here if `foo-body` was popped
      - foo-body    # first one being evaluated

foo-body:
  - match: bar
    pop: 1
  - match (?=\S)
    pop: 1

bar-body:
  - match: baz
    pop: 1
  - match (?=\S)
    pop: 1

baz-body:
  - match: ;
    pop: 1

Which strategy is the best one depends on how detailed meta scopes are to be applied, but that’s another lesson.

What does capture do? My current understanding is it matches each match’s capture groups to specific scope? So I can colour each specific part of a regex.

Correct. Matches of parenthesed groups (<pattern>) are added to a list of results. Each item can be assigned a separate scope using captures keyword followed by the index of the group.

Note however, If a pattern contains foo (bar)* baz, and a text contains foo bar bar bar baz, only the first bar would ba addressed. All following are not assigned a scope.

Using your example, # is scoped meta.preprocessor and include as keyword.control.include.

 - match: ^\\s*(#)\\s*\\b(include)\\b
   captures:
    1: meta.preprocessor.c++
    2: keyword.control.include.c++

Note: scope names are odd here. # is a punctuation and meta.preprocessor should be applied to the whole line. So the following would make more sense:

 - match: ^\\s*(#)\\s*\\b(include)\\b
   scope: meta.preprocessor.c++
   captures:
    1: punctuation.definition.preprocessor.c++
    2: keyword.control.include.c++

or even the following, if we want to skip leading whitespace.

 - match: ^\\s*((#)\\s*\\b(include)\\b)
   captures:
    1 meta.preprocessor.c++
    2: punctuation.definition.preprocessor.c++
    3: keyword.control.include.c++

On the corona sdk syntax file it seems to be this series of captures that controls the ability to colourize the parameters, but it seems to be broken or incomplete.

Looks very much like a relic of old TextMate age, when pushing contexts was not possible. Back in the days syntax definitions used very complex regular expressions to try to find all sorts of language constructs.

Forget them, they are deemed to fail!

Instead try to understand the language in detail, read the specs if any and try to use as simple as possible regular expressions in match statements using the power of push/set/pop to apply specialized sets of simple patterns to parse your code.

The closer a syntax definition follows official syntax specs the better chances are to maintain correct highlighting under all circumstances.

And it’s way more maintainable, even if number of contexts can be large.

0 Likes

#7

In order to avoid re-inventing the wheel, you probably want to check out ST’s bundled Lua syntax first, instead of trying to build something new on the rather over naive CoronaSDKLua.sublime-syntax.

If something special is required to support CoronaSDK, I’d recommend creating an inherited syntax based on ST’s Lua.

%YAML 1.2
---
name: Lua (Solar2D)
comment: "Lua Syntax for Solar2D: version 0.2 (after Lua Syntax: version 0.8)"
scope: source.lua.corona
version: 2

extends: Packages/Lua/Lua.sublime-syntax

file_extensions:
  - lua

contexts:

  # example of adding rules to existing contexts!
  statements:
    - meta_prepend: true
    # ...
0 Likes