Sublime Forum

Multiple regular expression search and replace on large files

#1

Sorry, I know that similar things have been asked before, but was wondering about suggestions for how to reasonably manipulate a large file, to hopefully cut it down to size and maybe make it into a more usable format.

In particular, working on the USB Host code for Teensy 3.6 and 4.0 beta (PJRC), I often use the Saleae Logic Analyzer to capture USB traffic, for maybe 20-30 seconds. Example trying to figure out the packets that are sent back and forth to a Bluetooth controller to maybe pair with an XBox One controller… I then have the Logic analyzer output the report, which outputs a CSV formatted file which for example might be 50000 lines long… But most of the time, I can manipulate this file down to maybe 100-1500 lines of data that I want to look trhough.

And I find my self doing a bunch of standard search and replace to start reducing this data…

Things like:

a) Remove all SOF files (Start of Frame): Search for:.,SOF,.\n replace with nothing…
b) remove all NAK sequences: .,IN,.\n.,NAK,.\n again replace with nothing
c) Remove all ACK lines: .,SOF,.\n

Then I often convert the IN/OUT/SETUP sequences from one two lines to one line: something like:
find: .,\b(IN|OUT|SETUP)\b,0x[0-9A-F],(.*)\n.DATA[01],(.\n)
replaced with: $1,$2,$3

Note: this last one may not have been exact, maybe depends on how many ,s I need to remove…

This up till now was pretty close how I did all of the postings and the like where I then imported this data into Excel and showed the data in forum posts for Bluetooth and joysticks…

If there was a simple way to try to automate this, it would be great. I currently am using sublime text search and replace… Thought I would try to make a macro that does this, but their macros don’t appear to want to include search/replace stuff.

I have the beginnings of using a regreplace plug in setup to do this… That uses Python script to do the first part of this: Reg_replace_rules…

{ "format": "3.0", "replacements": { "remove_usb_packets_ACK": { "case": false, "find": "^.*,ACK,.*\\n", "greedy": true, "replace": "" }, "remove_usb_packets_IN_NAK": { "case": false, "find": "^.*,IN,.*\\n.*,NAK.*\\n", "greedy": true, "replace": "" }, "remove_usb_packets_SOF": { "case": false, "find": "^.*,SOF,.*\\n", "greedy": true, "replace": "" } } }
And then command

[ { "caption": "USB Logic Packet format", "command": "reg_replace", "args": { "replacements": [ "remove_usb_packets_SOF", "remove_usb_packets_IN_NAK", "remove_usb_packets_ACK" ] } }, ]
And I think it wort of worked, BUT the performance was real bad. That is I waited for several minutes and then went outside to play with the dogs for little while and it was completed when I got back in…
And again this is only a subset…

If I had something that worked well like macros, I would like to expand it to maybe do things like:
Search for lines that state with IN and data area starts with: 0x02 and Insert EV_INQUIRY_RESULT, or if it is 0x17 insert EV_LINK_KEY_REQUEST
That is assume the above had data like: IN,0x01,017 0x…
Rule like: Search for: (IN,0x01,)(,0x17) replace with: $1EV_INQUIRY_RESULT$2

Likewise for some of the OUT lines, where again if I see something like(OUT,0x01),(0x01 0x04) insert the word HCI_INQUIRY between the commas…

Suggestions on how to setup such a script or which editor works to do something like this would be appreciated!

Kurt

0 Likes

#2

FYI - In this case I found a shortcut that gets me part way there a lot quicker, That is instead of trying to remove a lot of the stuff I don’t need, I use the linux grep utility to only grab the data that I mainly want:
That is any line that has: ,DATA in it and the line before it:

grep -B1 ",DATA" USB_Capture_WIndows_XBoxOne_Pair_packets2.txt > u2

Which in the last run I did, reduced the file from: 3,007,936 lines down to 2066 lines including a – line between each find, which I then removed…

So maybe I will try to find an easy way to automate this first part, and then maybe can use regreplace to do some of the cleanup… Probably would be better to do some script language, but will see:

0 Likes