Sublime Forum

**diomed** · June 3, 2016, 9:13am

Hi

For over a year now, I’ve been making and enhancing script that is part of SubtitleEdit,
but yesterday I’ve figured that this could be part of SublimeText, and to help me fix wrong written words into right ones, when editing ebooks and other large pieces of text.

Can anyone help me to get started of how would this code look in ST and where would I put it.
As a macro assume;

github.com

SubtitleEdit/subtitleedit/blob/master/Dictionaries/hrv_OCRFixReplaceList.xml

<OCRFixReplaceList>
  <WholeWords>
    <Word from="()d" to="Od" />
    <Word from="advokati" to="odvjetnici" />
    <Word from="Advokati" to="Odvjetnici" />
    <Word from="advokatima" to="odvjetnicima" />
    <Word from="Advokatima" to="Odvjetnicima" />
    <Word from="afirmiše" to="afirmira" />
    <Word from="ajpod" to="iPod" />
    <Word from="ajpoda" to="iPoda" />
    <Word from="ajpodu" to="iPodu" />
    <Word from="Ajpod" to="iPod" />
    <Word from="Ajpoda" to="iPoda" />
    <Word from="Ajpodu" to="iPodu" />
    <Word from="akcenat" to="naglasak" />
    <Word from="akcionara" to="dioničara" />
    <Word from="akvarijum" to="akvarij" />
    <Word from="akvarijumu" to="akvariju" />
    <Word from="amin" to="amen" />
    <Word from="Amin" to="Amen" />

This file has been truncated. show original

Please, help me out. Thank you.

**fico** · June 3, 2016, 2:09am

Something like this should work:

import sublime, sublime_plugin

class ocr_fix( sublime_plugin.TextCommand ):
	def run( self, edit ):

		regexPairs = get_RegEx_Pairs()

		for queryPattern, replacementPattern in regexPairs:
			replacements = []
			resultRegions = self.view.find_all( queryPattern, 0, replacementPattern, replacements )

			for index in range( 0, len( resultRegions ) ):
				self.view.replace( edit, resultRegions[ index ], replacements[ index ] )

def get_RegEx_Pairs():

	regexPairs = []

	regexPairs.append( ( "advokati", "odvjetnici" ) )
	regexPairs.append( ( "Advokati", "Odvjetnici" ) )
	regexPairs.append( ( "advokatima", "odvjetnicima" ) )
	regexPairs.append( ( "Advokatima", "Odvjetnicima" ) )
	regexPairs.append( ( "amin", "amen" ) )
	regexPairs.append( ( "Amin", "Amen" ) )

	return( regexPairs )

The command ocr_fix can assigned to a key-binding or command palette entry.

Here’s the same script with all of the replacement pairs: @ Gist

You might need to check the RegEx patterns, I just did a quick replacement of \\(?![nt"]) with \\\\ to make the strings valid for Python.

**diomed** · June 1, 2016, 10:56am

oh, wow! thank you very much

where do I put that file? in Packages or?

Also, not everything in original file is under regex.
I assume this converted all to it??

For example if I have čk -> čak
I wouldn’t want for ručka to turn out to be ručak.
So I wonder will this pick up singular words, or will it implement regular expression overall

**fico** · June 1, 2016, 12:51pm

You can save it @ /Packages/OCR Fix/

By default, the view.find_all function uses RegEx. You can also add a flag so that it searches for literal strings instead of RegEx, but I don’t think that would respect word boundaries.

I’d say your best bet is to stick with RegEx & add \\b where necessary to avoid partial matches.

**diomed** · June 1, 2016, 2:03pm

Please bear with me because it’s first time I do something like that.
I’ve put it in OCR Fix folder in Packages.

So how do I load it and run it, exactly?
I thought it would display under packages in menu.

When you say “flag it” what do you mean by that?
Also, a bit info on respecting word boundaries, if You don’t mind.
I apologize for so much questions.

**fico** · June 1, 2016, 7:09pm

• save this code to:
/Packages/OCR Fix/Default.sublime-commands

[

	{
		"caption": "OCR Fix",
		"command": "ocr_fix",
	},
	
]

• open the command palette with Ctrl + Shift + P
• type OCR Fix and press Enter

OR

• save this code to:
/Packages/OCR Fix/Default.sublime-keymap

[

	{
		"keys": ["ctrl+shift+alt+o"],
		"command": "ocr_fix",
	},

]

• press Ctrl + Shift + Alt + O

From Sublime Text > API Reference > View:

[Region]

find_all(pattern, <flags>, <format>, <extractions>)

Returns all (non-overlapping) regions matching the regex pattern. The optional flags parameter may be sublime.LITERAL, sublime.IGNORECASE, or the two ORed together. If a format string is given, then all matches will be formatted with the formatted string and placed into the extractions list.

Some of the RegEx patterns in the list you posted already use the word boundary metacharacter:

For example:

\b([aA])bsorbira will match absorbira in:

Case 1: "abc absorbira xyz"

but not in

Case 2: "abcabsorbira xyz"   or   Case 3: "abcabsorbiraxyz"

but it would match

Case 4: "abc absorbiraxyz"

In order to prevent case 4, you could use:
\b([aA])bsorbira\b

( In these examples, I used plain RegEx. Make sure you use properly escaped backslashes in the actual code for Python compatibility. EG: \\b )

**diomed** · June 2, 2016, 10:10am

I’ve created Default.sublime-keymap
and copy-pasted info you posted:

Now, I’ve created screenshot so you can tell if this is in good location or not.

What exactly should it happen when I do this?
Because I expected it to display in Preferences - Package settings.
Am I mistaken. Anyhow, nothing happens, and I don’t know what’s wrong.
Got any ideas? I’ve tried every uppercase/lowercase letter combination I could imagine,
because I’m not sure is it case sensitive??? finally, I’ve renamed file to ocr_fix.py like it’s in your example of commant, but no. nothing happens. and is restart of program necessary for this?
I did that, however, issue remains.

**fico** · June 2, 2016, 10:27am

Only plugins with a Main.sublime-menu file will show up there.

You should just be able to run the plugin with one of the two methods I described in my previous post.

**diomed** · June 2, 2016, 10:42am

But what exactly is supposed to happen when I press that keys: ctrl+shift+alt+o
Coz I see no effect.

**fico** · June 2, 2016, 11:40am

It will automatically replace any instances of the misspelled words from the entire document.

**diomed** · June 2, 2016, 2:23pm

so you say - I click those 4 buttons and it should automatically run and change words?
That does not happen.

Could you please test it if it happens on your pc?

Ali sedamdeset sedma godina Morgantea postala je sedamdeset sedma godina Sendovanija, i mada je Lok uspeo da neko vreme prikriva svoja dela od Krađoučitelja, još jednom prilikom je doživeo čudesan neuspeh u pokušaju da bude obazriv. Kada je Krađoučitelj shvatio šta je dečak uradio, otišao u posetu kapi Kamora i obezbedio dozvolu za jednu malu smrt. Tek se uzgred setio da ode bezokom svešteniku, ne da bi bio milosrdan, već zato što je to bila poslednja prilika da ostvari kakvu-takvu dobit.

**fico** · June 2, 2016, 2:47pm

Just fixed the class name, should work now.

I usually use the uppercase naming convention of PluginNameCommand & forgot that OCR Fix would require the lowercase convention of plugin_name.

I did notice that some of the RegEx patterns use capturing group replacements, & they’re showing up as literal symbols in the replacements ( $1, etc. ), so you’ll need to handle that with extractions or throw a manual RegEx replacement in the loop.

**diomed** · June 2, 2016, 3:19pm

Ah yes, I’ve found it, changed it, and now it works.

Could you please give me some examples of extractions and similar so I can fix it by them?
I’m not really much of a coder but one who works by example.

for example:
dečak -> dječak [instead of $1ječa$2]

I see that this will require lots of fixing here, before it works the way I perceived it would.

**fico** · June 3, 2016, 2:24am

I just updated the code, the extractions list fixed it. I believe it will work as you expect.

**diomed** · June 3, 2016, 8:26am

I can’t thank you enough for all your help and kindness.
This is, in general, what I wanted. Thank you so much.
I really appreciate it.

**acee** · June 3, 2016, 8:30pm

Is that a plugin that you are using to display the differences? If so, which plugin is that?

**fico** · June 3, 2016, 9:20pm

That particular plugin is Compare Side-By-Side. For Git files, I use GitGutter

There are also various Diff Plugins that offer different display styles.

**diomed** · January 29, 2018, 4:41pm

replacementPairs.append( ( "њ", "nj" ) )

Can anyone help me?
I’ve hit an obstacle I don’t understand.
Apprently, regex values above don’t want to change and that sign remains.

[Solved] How to convert this script to be a part of SublimeText?