Again, thanks.
This has turned into a multi-year project. We start with a list of cases that contain docket numbers from a pay web site. We use wget to grab the case pages. That merely has a description of the case, not the case itself. So each month results in about 70-80 files. Of those, we use grep to filter out the pages we don’t want, maybe about 40-50%.
Then using Sublime Text, I’ve used regex to reduce each line down to just the docket number. This is about 5-6 separate find/replace actions. That leaves the docket in AA123456ZZ format. Normal usage for these dockets has the hyphens, hence the insertion requested above.
And once we have the docket numbers, we insert those into a spreadsheet for tracking purposes, and then we request the actual cases through the government agency using the Freedom of Information Act (FOIA). About a month later, they send us PDF’s of each case.
We have about 25 years of these to process. In the end, we’ll put the cases online for free access. People shouldn’t have to pay $50/month for basic information that is already in the public domain.