Sublime Forum

Syntax definition for DNA files - ignore \n?

#1

Hi

I’d like to write a syntax definition for DNA sequence files (FASTA format). The syntax of such files is very simple. It consists of a header line that starts with a “>” sign. The following lines consist of the sequence itself. My problem is that sequence elements can span multiple lines (including the newline characters). Is it somehow possible to highlight such elements?

Here is a simple example, where the element atg (in lower case) should be tagged.

>SEQUENCENAME here may follow some comment GCTGCGAat gGATGATCA
Thanks for your help!

0 Likes

#2

does this work ok?

[code]<?xml version="1.0" encoding="UTF-8"?>

fileTypes fasta name fasta patterns begin ^(?!>) end $ patterns name variable.fasta match [ctga] scopeName source.fasta uuid 62d34382-8c31-4c0d-b03f-1af892b18f24 [/code]
0 Likes

#3

Thanks, but unfortunately it’s not what I am looking for. I would like to highlight ATG sequences across line ends. The regex pattern should also match when the A is the last character of a line and TG are the first two characters of the next line. The following pattern works in the search field:

(?i)(A\s\d]*T\s\d]*G)

…but it does not in the syntax definition. The modifier (?s) did not help either. It seems that the syntax is parsed by the engine line by line.

Here is my first draft of the syntax definition, which did not work (JSON):

{
    "name": "Fasta DNA", 
    "scopeName": "text.fasta",
    "foldingStartMarker": ">",
    "foldingStopMarker": "(?=^>)",
    "fileTypes": "fa", "fasta"], 
    "patterns": 
        {
          "contentName": "meta.sequence.fasta",
          "begin": "(^>(\\S+).*$)",
          "beginCaptures": { 
            "1": {"name": "comment.line.fasta"},
            "2": {"name": "entity.name.function.fasta"}
          },
          "end": "(?=^>)",
          "patterns": 
            {
                "match": "(?i)(A\\s\\d]*T\\s\\d]*G)", 
                "name": "keyword.fasta", 
                "comment": "start codon"
            }
          ]
        }
    ], 
    "uuid": "ca03e751-04ef-4330-9a6b-9b99aae1c418" 
}
0 Likes

#4

i came up with this one

<dict> <key>name</key> <string>variable.fasta</string> <key>match</key> <string>atg</string> </dict> <dict> <key>begin</key> <string>at$</string> <key>end</key> <string>^g</string> <key>name</key> <string>variable.fasta</string> </dict> <dict> <key>begin</key> <string>a$</string> <key>end</key> <string>^tg</string> <key>name</key> <string>variable.fasta</string> </dict>
yet simple

GCTGCGAat GCTGCGAat gGATGATCA

breaks it… if such a sequence is possible it might be only solution to write a plugin for this

0 Likes