Infocard Storage in DLLs (or "What's a BOM?")

cshake

After quite a bit of reading through the hex of the dll files for my various php-based infocard tools, I’ve found a little issue with nearly all non-vanilla infocards, specifically the part where they have a period at the start of the string, before the opening of the After comparisons to vanilla cards and a bit of research online and talking to people on the Disco dev team, I feel I should share this little tidbit with everyone (especially whoever wrote FL-ids and/or FLDev).

We’ll start by looking at the raw hex code of a vanilla card:
The resource index, language, length, and offset index of the start of the card are all coded in the dll (as the built-in dll tools will do for you), so we’ll just look at the start of the string itself, starting at the indicated offset.

FFFE 3C00 3F00 7800 6D00 6C00 2000 7600
6500 7200 7300 6900 6F00 6E00 3D00 2200
3100 2E00 3000 2200 2000 6500 6E00 6300
6F00 6400 6900 6E00 6700 3D00 2200 5500
5400 4600 2D00 3100 3600 2200 3F00 3E00
```which when read as ISO-8859-1 text looks like: (with the character \x00 "Null" being removed)

> ÿþ

Now, a card from Discovery.dll:

2E00 3C00 3F00 7800 6D00 6C00 2000 7600
6500 7200 7300 6900 6F00 6E00 3D00 2200
3100 2E00 3000 2200 2000 6500 6E00 6300
6F00 6400 6900 6E00 6700 3D00 2200 5500
5400 4600 2D00 3100 3600 2200 3F00 3E00


> .

You can see the difference - the first two bytes are changed from "FFFE" to "2E00", where \xFF and \xFE alone look like junk characters and \x2E is a period. Why was this done?
From talking to the disco infocard team, the very simple answer is "We stripped the junk characters because they don't work when posted to a forum, but if there was nothing there Freelancer ate up the < and all the rest of the infocard was output as a text string and you saw XML everywhere, so we just added something to fix it."

The problem with this is that while it does technically work, that's only because Freelancer itself is somewhat fault tolerant and when it finds invalid data it just assumes the default character encoding. Also, XML strings are not allowed to have text outside the tags, so ".
Now, all this and I have yet to say what the \xFFFE actually is, so here it is:

> A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
> (from http://www.unicode.org/faq/utf_bom.html#BOM)

Without going too far into the whole text encoding thing, please see http://en.wikipedia.org/wiki/Byte-order_mark#UTF-16

Basically, Freelancer requires XML text stored in the HTML resource type of DLLs to have a byte order mark at the beginning of the string so it knows the "endian-ness" of the string. When the correct mark is not available (i.e. when a period is substituted), it seems to fail gracefully to expecting UTF-16LE, which is the default for all the vanilla cards.

So, the point of this post: FLDev, FL-ids, and anyone else who has a program that writes dlls for Freelancer, would you please have your tools add the Byte Order Mark to the beginning of the resource instead of just writing UTF-16LE without it? For legacy and compatibility reasons it might help to detect a period before the
-Chris
(Thanks for Sovereign for listening to this over skype and then saying "well, how come nobody else knows about it?" and reminding me to post here.)

StarTrader

You are absolutely correct. Why have standards written by real “boffins” if we don’t follow them?

adoxa

Actually, Freelancer does two things here. In one case, it will test for the BOM and, if present, skip it. In another case, it will simply skip the first two bytes, assuming the BOM is present. The JFLP resources include it where it is needed (the rumors), but leaves it out otherwise. The XML signature is also not needed, so that’s left out, too. Of course, I expect the original DLLs were created from actual .xml files, where it makes sense to include them. Within the DLL, though, I just save some space.

BTW, they are not NUL characters, they’re the high bytes of a single UTF-16 character. It should really look like FEFF 003C 003F … You got away with it these, but there are a few actual Unicode characters used, which you’d miss. FRC takes care of it all.

cshake

BTW, they are not NUL characters, they’re the high bytes of a single UTF-16 character. It should really look like FEFF 003C 003F … You got away with it these, but there are a few actual Unicode characters used, which you’d miss. FRC takes care of it all.

Right, they are the high bytes, when interpreted as UTF-16. I’m fully aware of what little-endian means here. However, when the string is interpreted as ISO-8859-1 (which some earlier programs do, especially non-freelancer-specific resource viewers), they are NUL. I guess I didn’t make that clear in my first post, they are useful parts of the string when you read it correctly.

There are a total of 5 or 6 characters in the entirety of the vanilla xml resources that use the high bytes, mainly squiggly quotes and accented letters. The PHP scripts I wrote to dump the resources to a SQL database correctly handle real UTF-16LE, but I store it all internally as UTF-8 because it takes half the space for 99.999% of the data, is able to handle every character used in any resource I’ve ever seen, and uses a second byte only when needed.