Sublime Forum

Unicode replacement challenge

#1

In a long bilingual vocabulary file, many letters are represented by one of six “surrogate” letter-pairs marked with an ‘x’: cx, gx and so on. I am seeking instructions for how to replace the six surrogate letter-pairs throughout the file with their counterparts in Unicode.

At the moment I do this manually, using the Windows Find and Replace function, replacing each of the six letter-pairs throughout the file one at a time. I would be grateful for instructions on how to do this more speedily in Sublime Text if that is possible.

The six Unicodes take the form: &#nnn; where nnn represents a three-digit decimal number (see below).

PLEASE NOTE that in the following list of the six surrogates and their Unicode counterparts, I have inserted a space between the &# and the decimal to preserve the Unicodes on-screen; without it, the forum’s screen by default converts the codes to the letters they represent - “ampersand-hash-365-semicolon” becomes “ĉ” - but I have to preserve the Unicodes in their &#- form.

The x-surrogates and their Unicode counterparts:

cx &# 265;

gx &# 285;

hx &# 293;

jx &# 309;

0 Likes

#2

My apologies. The full list of the six x-surrogates and their Unicode counterparts is:
cx &# 265;
gx &# 285;
hx &# 293;
jx &# 309;
sx &# 349;
ux &# 365;

0 Likes

#3

One way to tackle this (as ST doesn’t support a very rich regex-replacement syntax which would allow us to do it all in one step with just a regex replacement but it does support that same syntax in snippets):

  • Tools -> Developer -> New Snippet…
  • replace the template with the following:
<snippet>
    <content><![CDATA[${SELECTION/(cx)|(gx)|(hx)|(jx)|(sx)|(ux)/&#(?{1}265:)(?{2}285:)(?{3}293:)(?{4}309:)(?{5}349:)(?{6}365:);/g}]]></content>
    <!-- Optional: Set a tabTrigger to define how to trigger the snippet -->
    <!-- <tabTrigger>hello</tabTrigger> -->
    <!-- Optional: Set a scope to limit where the snippet will trigger -->
    <!-- <scope>source.python</scope> -->
</snippet>
  • and save it, as something like replace-unicode.sublime-snippet in the folder ST suggests (Packages/User)
  • Switch back to your document containing the x-surrogates
  • Open the Find panel (with regex mode on) and enter the find pattern: [cghjsu]x
  • Find All
  • Open the Command Palette
  • Type snippet replace-unicode and choose the snippet you created
  • Press Esc to get one caret/selection back
1 Like

#4

Thank you for the suggestion. I was unable to get the regex to execute anything, probably because it couldn’t find the vocabulary file but on review, I don’t think this routine would save much time over the one-at-a-time Find and Replace method which takes about five minutes, so I’ll stick with that. However I do very much appreciate the help I have received from this forum. Mike L

0 Likes

#5

Sorry. I take it all back. I neglected to open the vocabulary file in ST so there was nothing for the regex to operate on.

Actually this new one seems to work just fine, and now that it has been saved, the replacement operation is as fast as I was hoping for. Thanks again. Mike L

0 Likes