Sublime Forum

How does Sublime detect file encodings?

#1

Hi,

I have a pretty huge web project here. Grown for like 11 years now. I’m new to the project and have to make it ready for some heavy upgrading (PHP, etc…). Because of the structure of the project it is necessary to convert all code files to utf-8. But they are in different encodings. Sometimes utf-8, cp1252, ISO-8859-1, ISO-8859-15 (euro sign). Somehow sublime detects the files correctly. As the only program of all I tried. All the others failed mostly when to differentiate between ISO-8859-15 and cp1252.

So I would like to know how sublime does this so well. Maybe it could help me.

Thanks

Silberling

0 Likes

#2

Sublime Text doesn’t do any meaningful encoding detection: if a file is valid UTF8, it’ll be opened as UTF8, otherwise it checks for any UTF16 BOM indicators, and finally just uses the fallback encoding.

0 Likes

#3

Hmm, okay. But how is it able to detect cp1252 correctly while others detect ISO-8859-1(5)?
file -i detects wrong, chardet doesn’t get it, notepad++ doesn’t get it.

0 Likes

#4

1252 is the fallback encoding - if a file isn’t valid UTF8, it’s just assumed to be 1252

0 Likes

#5

Thats exactly what I didn’t want to hear :wink: Hoped to get any idea how to handle this mess but I think I have another solution. But thanks for the info :slight_smile:

0 Likes

#6

As an aside, you may want to look into https://github.com/batterseapower/libcharsetdetect

1 Like

#7

Thanks for the link. As far as I can see in the code example it determines one charset for a bunch of text (like one whole file)? I’m gonna write it into my wiki for later :slightly_smiling:

I ran into another problem. Some files have multiple encodings. The first part of a PHP file is cp1252, later on the HTML code in strings is UTF-8 and back to something else, without phps conversion functions. This is a real mess.

But I found this one: https://github.com/neitanod/forceutf8 and wrote a little PHP script called by my bash script which does the conversion for me. This one tries depending on what it finds and converts it to UTF-8. Every char handled individually. Even then unused multilanguage files I accidently did not exclude yesterday were right now.
This worked pretty well so far. Just have to use sublimes well-made all-files search for php conversion functions and replace some code and I should be done.

0 Likes