As promised, a full description of the Arch Linux compatibility bug:
The symptoms were that in a recent dev build, all the regexes loaded from .tmLanguage files were missing their last character, which would often render them invalid. This wasn't happening on any other operating systems, nor any other Linux distributions.
The first thing I tried was getting a livecd version of Arch Linux (ArchBang in this case), and running it under a VM. Everything was working peachy - I had a bug that was triggering only for users of one specific Linux distribution, except not for me.
One of the changes in the broken build was that plist files (as .tmLanguage ones are) are now represented differently in memory, so I assumed that the .cache files weren't entirely compatible (every .tmLanguage has a corresponding binary representation in a .cache file, to save on XML parsing time at startup), and something has gone wrong there. Further indicating this may have been the problem was that the size of the generated .cache files had changed between the two builds.
This turned out to be a red herring. The .cache files contain key-value dictionaries, stored in memory as hash tables, and they're written out in the order the keys live in memory. This order is usually consistent, but had changed in this build because the key representation changed from wchar_t * (i.e., UTF-16 on Windows and UTF-32 on Linux and OS X), to a UTF-8 encoded char *. .cache files are zlib compressed, so changing the order of the keys will change the compressibility of the file, leading to different file sizes.
My next guess, given I wasn't able to reproduce the issue, was that it was memory management related. One of the changes in the same build were to place plist strings in a memory arena (*1). Along with this, the memory arena code was changed to make use of malloc_usable_size() (aka malloc_size on OS X and msize on Windows) to better utilise available memory, and glibc has at least one scenario where it implements this function incorrectly ([sourceware.org/bugzilla/showbug.cgi?id=1349](http://sourceware.org/bugzilla/show_bug.cgi?id=1349)). Alas, removing the calls to malloc_usable_size() didn't fix anything.
I was pretty much out of ideas by now, and still unable to replicate the bug myself. As a hunch, I guessed that using a derivative of Arch Linux rather than the real thing could be related to this. Several hours later, after downloading and installing Arch the hard way, I was able to replicate the bug!
Running under GDB I could see that, yes, the regexes were indeed missing their last character. They had it when they were loaded from disk, but lost it by the time they made their way into the lexer. But only on Arch. Tracking things down, I could see that the string was copied into the memory arena, but when it was later used, the last character was gone. I looked at my 3 line copying function, but there were no visible problems. I rewrote the function to work in a different manner, but the end result was the same. I looked up the symbols in GDB to make sure the function I thought was being called was the actual one being called. Here's the function in question, if you're following along at home:
uchar * u_dupcstr(usubstring s, memory_arena * arena)
const size_t num_chars = s.size();
uchar * data = (uchar *) arena->alloc((num_chars + 1) * sizeof(uchar));
memcpy(data, s.begin(), num_chars * sizeof(uchar));
data[sz] = '\0';
Stepping through this function, I could see that the string was being copied correctly, and the null terminator was in the right spot. However when the returned value was printed, the last character (which I could see just fine in GDB!) was missing. Looking at the data where it was used via GDB, the last character was indeed there. wcslen (strlen for wchar_t data) however, was saying it wasn't: it was reporting the string as being one character shorter than it actually was. This is when I paid a bit more attention to the memory reported by GDB:
0x18d96ed: 0x74 0x00 0x00 0x00 0x65 0x00 0x00 0x00
0x18d96f5: 0x78 0x00 0x00 0x00 0x74 0x00 0x00 0x00
0x18d96fd: 0x2e 0x00 0x00 0x00 0x70 0x00 0x00 0x00
0x18d9705: 0x6c 0x00 0x00 0x00 0x61 0x00 0x00 0x00
0x18d970d: 0x69 0x00 0x00 0x00 0x6e 0x00 0x00 0x00
0x18d9715: 0x00 0x00 0x00 0x00
What we're looking at here is the wchar_t representation of "text.plain", with 4 little-endian bytes per character. There are 10 characters in "text.plain", so including the null terminator, it has a 44 byte representation in memory, and it's all nice and null terminated, as you can see above. Every other system has wcslen() report 10 for the above data, however on Arch Linux it returns 9.
The issue is the address of the first character, 0x18d96ed. wchar_t values have an implementation defined alignment requirement in C++, and being a 4 byte value here, that generally means they should lie on a 4 byte boundary, but 0x18d96ed is not on a 4 byte boundary (*2). The misaligned data comes from the memory arena, which is used to store a mix of UTF-8 and UTF-32 data, so will naturally end up returning unaligned memory addresses unless care is taken. In practical terms, x86 CPUs will happily load unaligned data, and you have to go out of your way to write code that doesn't handle unaligned data correctly.
Enter Arch Linux, with its fancy new glibc 2.15, where the implementation of wcslen lives. One of the changes in glibc 2.15 is an optimised version of wcslen, that apparently doesn't like unaligned data.
Long story short, my wchar_t strings are now all properly aligned, and everyone's happy again. The moral of the story is, of course, just use UTF-8 everywhere.
As to the story of ArchBang, and why it didn't reproduce the problem, glibc 2.15 only landed in Arch a few weeks ago, and the livecd dates from before then. In reality, this is a bit of luck: if I didn't know exactly which build introduced the issue, it would have been much harder to track down.
*1. Memory arenas are a technique to coalesce multiple small allocations into a single larger allocation. When you have a lot of data with the same lifetime, they provide faster allocation and deallocation, better locality of reference, and less fragmentation that just mallocing the allocations individually.
*2. When I was a young fellow, I was debugging a misbehaving program with a coworker, and marveled at his ability to tell if a hexadecimal memory address is 4-byte aligned at a glance. The trick, of course, is to just look at the last digit.