Sublime Forum

BUG With ISO-8859-1

#1
ST3 Build 3175 | Windows 10

I’m working on a project where source code files are in ISO-8859-1.

The file extensions are correctly associated to the encoding via a custom syntax definition.

The problem only occurs when I close a source file and reopen it, or if I open a new file with special chars (but not always). If the file contains some characters out of the Ascii range (but valid ISO-8859-1 Latin characters) the file is often (but not always) shown as bein UTF-8 (ie, in the bottom bar it shows “UTF-8”, and some chars are wrongly encoded as ?).

Even if I change the enconding back to ISO-8859-1 (via the encoding menu on the bottom bar of ST3) I often need to close and restart ST for the characters to show up properly — even though some other files with the same extension are showing up properly encoded.

Proper character representation is usually preserved from session to session, the problem only occur when opening a file (eg, closing it and reopening it). It usually only occurs when the file contains accented letters (Italian accented vowels, or some other non-Ascii chars).

Initially I though that it might have been due to some out-of-range character being accidently pasted into the text, and the file being forced to UTF-8 in order to accomodate it, but then I just discovered it was a representation problem that required restarting the app.

As for the last point mentioned — ie, pasting a character which is out of the encoding range — I think that ST should warn before switching encoding to UTF-8, because once it has switched it seems that trying to undo the copy operation doesn’t restore the file encoding back, so this might be an encoding breaking operation and ST should protect files which have been set to be strictly in a given encoding! But maybe the undo problem is tied to the fact that currently ST can’t handle refershing the file encoding without restarting.

Example File

Here is a link to a file that shows the problematic behavior mentioned above:

https://github.com/tajmone/Alan3-Italian/blob/master/alanlib_ita/lib_definizioni.i

That file is valid ISO-8859-1 and it compiles correctly with the target compiler. I’ve worked on in in ST3 since the onset, experiencing the problems stated above when I close and reopen it.

0 Likes

#2

I also noticed another problem with ISO 8859-1 files, related to Search-&-Replace operations of files/folders: if the file is open in the editor, then encoding is preserved, otherwise the file is treated as UTF-8 and encoding is messed up.

It seems that when the file is open in the editor, ST acknowledges its encoding (as set in the status bar encoding dropdown), while when carrying out replace operations on files not open in the editor it assumes they are UTF-8.

I’ve defined in the syntax definition and settings that the file extensions for this syntax should be treated as ISO 8859-1, the extensions being “*.alan” and “*.i”. If I open a file with any of these two extension, ST automatically reckognizes them as being Alan files, and sets their encoding to ISO 8859-1; so the extensions are properly associated to the desired syntax and encoding.

But is the file contains a “è” character, then ST opens it as UTF-8 (and the “è” is shown corrupted)! But “è” is a valid ISO 8859-1 character. I then have to manually set the econding to ISO 8859-1 AND restart ST in order to show the character correctly. So, definitely ST has some problems handling ISO 8859-1 files.

As for the file replace operations, I’d expect ST to treat files with “*.alan” and “*.i” extensions according to the econding set in the Alan syntax definition, but apparently ST treats them as UTF-8 regardless of these settings — unless the files are open in the editor.

This has caused me lots of trouble because after global project search and replace operations I’ve found many source files corrupted and had to fix manually all the accented letters which had become a “?” character (or some other strange character). Unfortunately, undoing doesn’t fix things in this case.

0 Likes

#3

The situation is somewhat improved by using “Reopen with enconding” after “Set econding”, but ST is not behaving as expected — ie, file extensions which are associated with ISO-8859-1 encoding are open with correct encoding only if empy or if no accentede Latin1 characters are present.

While working with ISO-8859-1 files, files often get corrupted, especially if search and replace operations are involved.

Also, cut and paste operation from one source file to another (both ISO-8859-1) corrupts the source, because text is pasted as UTF-8.

Although ST3 offers a wide range of encoding options, some of them are problematic to use right now, with ST not handling econding well and corrupting source files.

This a rather serious issue, as accented letters then become a ? in the source, and they have to be manually restored, one by one.

Could you please look into the issue? and maybe provide some tips on how to prevent files corruption (eg, by explaining which operations are more likely to mess up encoding of the source files).

Thanks.

0 Likes

#4

Solution: Fallback Encoding

I’ve managed to solve most of the problems mentioned above (regarding opening correctly ISO-8859-1 files) by adding to the syntax settings ISO-8859-1 as the "fallback_encoding" too:

{
    "default_encoding":  "Western (ISO 8859-1)",
    "fallback_encoding": "Western (ISO 8859-1)",
}

… whereas before it was falling back to UTF-8.

Maybe the documentation should cover in more details these setting for non Unicode file encodings.

Unicode Pasting: Problems and Feature Request

The problems relating to cut/copy and paste operations still persist though.

For example, if I cut from a Unicode file the text “AAA — BBB” (with the em-dash being out of the ISO-8859-1 range) and paste it into a source file whose extension is set to be "Western (ISO 8859-1)", ST performs the paste operation and I get an error when I try to save:

  • After pasting “AAA — BBB”, the encoding shown in the status bar is still “Western (ISO 8859-1)”
  • At save time I get the error popup mentioning that some chars are not representable, and that ST is falling back to UTF-8 (file encoding is converted and saved without asking for confirmation!)
  • Undoing the paste operation doesn’t bring the file’s encoding back to ISO-8859-1

In contexts where the ISO encoding is strictly required, the above situation is problematic; and the feature request below could avoid all this.

Feature Request: Strict-Encoding Settings

Although this is a reasonable behaviour on ST, it would be useful to have a setting to enforce strictly a given encoding to a syntax/file extensions, which would then prevent pasting characters which are out of range for the given encoding. Something like enforce_encoding:

{
    "default_encoding":  "Western (ISO 8859-1)",
    "fallback_encoding": "Western (ISO 8859-1)",
    "enforce_encoding": true
}

If the user attempts to past an out-of-range char into a file with strict encoding settings, ST should show an error pop-up and block the operation, in order to preserve the file encoding integrity.

Similarly, this setting should also be kept into account by ST is Search-&-Replace operations, especially when applying them to folders and files. Again, a substitution that would introduce out of range characters should be blocked.

I think this would be an important setting, because it’s quite easy to corrupt a file’s encoding via paste or S&R operations, and once the file is converted to UTF8 (or if the encoding gets corrupted) restoring the original file’s special characters is not always easy, and in case of encoding corruption often impossible (except manually).

0 Likes