Understanding Encodings

A Guide for the Perplexed

By Matt Neuburg

Matt Neuburg is the author of REALbasic: The Definitive Guide, and a member of the editorial board of REALbasic Developer. This article was originally published in REALbasic Developer Issue 2.2 (Oct/Nov 2003).

Judging from some of the messages on the REALbasic mailing lists, many people seem to be perplexed about encodings in REALbasic 5. However, there's no need to be. Most of those asking for help seem guilty of going to one extreme or the other - either of worrying needlessly about encodings, as if they were some sort of all-pervasive and dangerous mystery to be guarded against at every instant, or else of failing to admit that they exist at all. This article is intended to help you walk a middle path, so that you can be aware of encodings without losing any sleep over them.

The Text Myth

It is perfectly natural to suffer from a delusion that there is text in your computer. Well, this may come as a shock to you, but sometimes a little shock is a good thing, so here goes: In the mind of your computer, there is really no such thing as text. The only thing your computer knows about is numbers.

The reason you are lulled into this belief that your computer knows about text is that sometimes your computer manipulates numbers in such a way that they behave like text. In order for your computer to do this, there must be a set of rules for translating between numbers and text. For example, we could decide that 1 means "a", that 2 means "b", and so forth; or we could decide that 100 means "a" and 200 means "b". It really doesn't matter what rules we decide upon, as long as we are clear on what rules we're using. It's all entirely arbitrary, rather like one of those secret codes you probably made up when you were a kid.

An encoding is simply a set of rules for performing this kind of translation. An encoding mediates between a sequence of numbers (or bytestream), on the one hand, and the visually representable textual characters (or glyphs) on the other. So you can see right away that encodings are not something to be afraid of; they're something to be grateful for. Without encodings, your computer wouldn't be able to do text at all.

So why all the fuss? There are two reasons. First, there can be any number of encodings, and historically, in fact, a very large number of them have been devised. Second, although, in the past, you may have ignored the matter of encodings, you can't do that any longer: in REALbasic 5, encodings matter.

Let's get one thing clear at the outset: The fact that in REALbasic 5 you must now be concerned with encodings is a good thing, not a bad thing. The bad thing was that encodings were rather arcane and difficult to work with in earlier versions of REALbasic! In REALbasic 5, they're easy. Also, in the past, you were probably operating on some blithe assumptions; for example, you probably just assumed, maybe without even realizing it, that every string was MacRoman. Well, you can't assume that any more; but that's good, because such an assumption was bound to cause you trouble anyway as you sally forth into the brave new worlds of Mac OS X and Windows.

What a String Really Is

Every REALbasic string now comes with two pieces of information: the bytestream of numbers representing its content, and the encoding that determines how those numbers correspond to glyphs. There are times, of course, when a string's encoding isn't of any relevance at all - for example, sometimes a sequence of numbers is just a sequence of numbers - but much of the time it is relevant, and it is at those times that you may need to be conscious of it. In particular, there are three main occasions when you might need to be careful about encodings:

(1) When text is to be displayed. This is obvious, because what's displayed is the glyphs corresponding to a string's bytestream, and the encoding is how REALbasic knows what glyphs those are. But don't worry: REALbasic is smart. It doesn't matter what encoding a string is in, just so long as it has one, and just so long as its bytestream legally represents a sequence of glyphs in accordance with that encoding; REALbasic will then do the right thing in presenting the glyphs on the screen or in print.

(2) When performing text-based transformations and operations. For example, in order to convert a string to uppercase, we have to know which letters are the minuscule letters and what majuscule letters each of these corresponds to. That is a matter of characters or glyphs, so the bytestream alone is insufficient: we also have to know the string's encoding. Indeed, even the simple notion of a string's length depends upon its encoding, because what we want is not the size of the string's bytestream but the count of its corresponding glyphs.

(3) When exchanging data with the outside world. A good example is writing to and reading from a file. A file isn't a string; it is just a sequence of numbers, and it doesn't contain any information about the meaning of those numbers, except by external convention. For instance, let's pretend there's a "Blarney" encoding. Well then, if you save a bytestream to a file as a sequence of numbers representing text in the Blarney encoding, then when you read the file again later you had better know in some other way that this sequence of numbers represents text in the Blarney encoding, and any other programs that intend to read this file had better know it too, or it won't be possible to display its data properly as glyphs, or to perform text transformations and operations upon that data.

A Survey of Encodings

Before talking about how REALbasic lets you work with encodings, let's pause and talk about what encodings there are. The logical way to discuss this is in historical terms, because the existence of so many encodings is largely a matter of historical circumstance.

In the beginning there was ASCII, a convention for expressing glyphs as 7-bit integers (0-127). This limited the representable "alphabet" to 128 characters (fewer, actually, since some of the numbers were reserved for invisible "control characters"), but it was better than nothing. ASCII encoded the Roman alphabet, numerals, and some basic punctuation and arithmetic symbols, enough to express the code, input, and output of the programs of the day.

As time went by, ASCII was extended in the obvious way - by means of the eighth bit, thus making available a second set of 128 characters (128-255), sometimes called the "high ASCII" characters. Many high-ASCII encodings arose, taking advantage of these additional characters in different ways. The number 150 in MacRoman represents a Spanish enye; in WindowsLatin1, an en-dash; in MacCyrillic, a Russian tse. So it was now possible to express more symbols and alternate alphabets, but it was crucial to know what encoding you were using. Also, even 256 values could not possibly encompass large glyph sets such as Chinese and Japanese characters; to accomodate these, double-byte encodings were instituted, with each glyph represented by a two-byte integer (0-65535).

(If you'd like to be really confused by all the variants and permutations of the various encodings, you can read more in this Apple document.)

The Coming of Unicode

Unicode emerged, starting in the late 1980s, as an effort to standardize one ultimate double-byte encoding that would permit expression of every glyph in every language. 65336 characters turned out not to be enough, so certain numeric values were reserved to allow specification of additional sets of 65336 characters, called "supplementary planes"; there are 16 supplementary planes, so Unicode can theoretically express over a million characters, though only about one-tenth of these have actually been assigned values. (See http://www.unicode.org/.)

There are actually three Unicode encoding forms. These act like different encodings, but they are actually more like mathematical variants of one another; they are different ways of expressing the same Unicode numerical values.

UTF-32 (also called UCS4) is the simplest encoding: the numeric value of every character appears as a 4-byte integer. For example, Spanish enye is Unicode (hex) F1, so its UTF-32 representation is 000000F1. REALbasic can refer to the UCS4 encoding, but it can't work reliably with it, so you probably shouldn't try to use it; this article doesn't discuss it any further.

UTF-16 is a little more complicated: the numeric value of every character appears as a 2-byte integer, unless the character is from one of the supplementary planes; in that case, the character appears as two 2-byte values from the special range D800-DFFF. There are rules for reinterpreting each of these special 2-byte values as half a 2-byte integer plus information about what plane it comes from. Just to make life more complicated, UTF-16 files come in big-endian and little-endian flavors; these are distinguished by an initial byte-order mark (BOM), which is either FEFF (big-endian) or FFFE (little-endian). The UTF-16 representation of Spanish enye is 00F1.

UTF-8 is still more complicated, but it's also very ingeniously designed to be compact and easily machine-parsable. In contrast to UTF-16, where every string is at least twice as long as its ASCII counterpart, in UTF-8 the 128 ASCII values appear as themselves, a single byte, so that they take up no extra space; ASCII strings are completely interchangeable with UTF-8 strings consisting of just ASCII values. Other characters are represented by two, three, or four bytes, in accordance with a clever mathematical formula such that (1) you know instantly from the first byte of a UTF-8 character how many bytes it consists of, and (2) if you start with a byte in the middle of a UTF-8 string, you know instantly whether this is the first byte of a character and, if not, you can easily find the first byte. The UTF-8 representation of Spanish enye is C3B1, but note that that's just a way of writing bytes; it still represents (by way of the mathematical formula) the numeric value F1.

REALbasic's Encoding Tools

The old TextConverter class should no longer be necessary for most purposes. Instead, start with the Encodings global object, where each standard encoding is a property - encodings.macroman, encodings.ascii, encodings.utf8, and so on. To ask whether a string has an encoding, test its encoding property against nil:

if s.encoding = nil then // ...

A string's encoding property is a TextEncoding instance, and so are the properties of the Encodings class. Since these are separate instances, you can't use ordinary equality to compare them. (It beats me why operator overloading has not been implemented to permit this; after all, this is exactly the sort of thing that operator overloading is for.) Instead, you must use the equals() method. So:

if s.encoding = encodings.ascii then // ... wrong
if s.encoding.equals(encodings.ascii) then // ... right

If a string has no encoding, but you happen to know what encoding its bytestream is supposed to represent, you can attach that information to a copy of the string using the defineEncoding method:

if s.encoding = nil then
  s = s.defineEncoding(encodings.ascii)
end

Naturally, you shouldn't lie to REALbasic. ("Do not taunt happy fun ball!") If you use defineEncoding to claim that a certain bytestream works in a certain encoding and it doesn't, you might end up with gibberish when you try to display or operate on the string.

If a string has an encoding and you want a string representing the same glyphs by way of a different encoding, use the convertEncoding method:

s = s.convertEncoding(encodings.utf8)

Once again, you should be careful; don't convert illegitimately. MacRoman's characters are a superset of ASCII's, and UTF-8's characters are a superset of MacRoman's, so you can convert from ASCII to MacRoman and from either of those to UTF-8; but if you convert in the other direction, the string you start with might contain a character that can't be represented in the encoding you're converting to - and in that case, the results can be unpredictable.

Characters and Literals

If you use the plain old chr() function, REALbasic assumes you want ASCII, so you'd better pass a value less than 128 or you'll end up with a nil encoding. On the whole you should probably stop using this function, and use instead the equivalent TextEncoding chr() method. For example, to make a Spanish enye, you can say this:

s = encodings.utf8.chr(&hF1)

We start with the value F1 because that's the Unicode numeric equivalent of enye. If, for some reason, you wanted to compose a Spanish enye using its UTF-8 bytestream representation, you could also do that:

s = chrb(&hC3) + chrb(&hB1)
s = s.defineEncoding(encodings.utf8)

Literal strings, including strings defined by quotation marks and as constants in modules, are UTF-8. Mac OS X Unicode input methods, including the Character Palette, work in the editor, and REALbasic project files are saved as UTF-8, so everything works coherently. Opening a REALbasic 4.5 project in REALbasic 5 works coherently too, because literals are converted to their UTF-8 equivalent; but if you open a REALbasic 5 project in REALbasic 4.5, your string literals can be destroyed.

Concatenated strings take on the "larger" encoding, if there is one, or UTF-8 if not. So, for example, if you concatenate an ASCII string with a MacRoman string the result is a MacRoman string, but if you concatenate a MacRoman string with a WindowsLatin1 string the result is a UTF-8 string, since neither MacRoman nor WindowsLatin1 is a superset of the other.

Text Functions

Text functions such as len(), mid(), asc() and so forth continue to work just fine, provided a string is well-formed - that is, provided it is marked as having some encoding, and provided this encoding matches its bytestream.

If you want to operate on the bytestream directly, use lenb(), midb(), ascb(), and so forth. Functions that yield strings, such as midb() and leftb(), will yield a string with an encoding attached; but that doesn't mean the string is valid in that encoding, and in any case this shouldn't matter to you because presumably all you're interested in is the bytestream. The following code illustrates:

dim s, b as string
s = "ñ"
dim i, u as integer
u = s.lenb
for i = 1 to u
  b = b + hex(s.midb(i,1).ascb)
next
msgbox b // C3B1, its UTF-8 encoding
msgbox hex(s.mid(1,1).asc) // F1, its Unicode value

Regex

The RegexOptions utf8 property no longer exists. Instead, a Regex object will set itself automatically to use UTF-8 if necessary, and string encodings will be converted transparently as required. In particular, the resulting RegexMatch will return subexpressionString values that are properly encoded. This means you can find out whether a search was performed using UTF-8 by testing whether the resulting RegexMatch's subexpressionString is in the UTF-8 encoding.

The main tricky part here is that RegexMatch's subexpressionStartB, which has replaced the old subexpressionStart, reports position in terms of 0-based bytes, without regard to encoding, and not in terms of 1-based characters. This means that if you want a 1-based character position you must derive it yourself. For example:

dim rx as regex, rxm as regexmatch
dim s as string
s = "ñññho"
rx = new regex
rx.searchPattern = "ho"
rxm = rx.search(s)
dim bPos, cPos as integer
bPos = rxm.subexpressionStartB(0)
cPos = s.leftb(bPos).len + 1 // translate
// now let's prove that it works
dim cLen as integer, temp as string
cLen = rxm.subexpressionString(0).len
temp = s.mid(cPos,cLen)
msgbox "Found " + temp + " at position " + str(cPos)

In that example, the RegexMatch subexpressionStartB(0) value is 6, because each of the three preceding "ñ" characters occupies two UTF-8 bytes and 0-based counting is used; we translate that to 4, describing the start of the string "ho" in "ñññho" in 1-based character-position terms.

Interface Items

Interface items that display text, such as PushButtons, StaticTexts, and EditFields, work somewhat the same way as string concatentation: depending on the circumstances, they may convert text that is handed to them to some other encoding. Therefore you should make no assumptions about the encoding of an interface item's text based on what you put into it; when you extract the text later, it may turn out to have some other encoding. An EditField's text will almost certainly turn out to be UTF-8, which makes sense because the user can type into an EditField, which thus needs to be able to accept any Unicode character.

Text that must be shared with the system is rather like an EditField; the user won't interact with it, but the system will, so it needs to be in a flexible form that the system can deal with. Thus, text placed on the clipboard or put into a DragItem is converted to UTF-16; file and folder names are converted to UTF-8.

Files

When you write a string using a TextOutputStream, you write the string's bytestream. How that bytestream relates to the visible representation of the string depends, of course, on its encoding, which is a completely separate matter. When you read those bytes back in again, you probably want to reconstruct the original string, meaning both the bytestream and its encoding. To help you do this, the TextInputStream class now has an encoding property; you set this property before doing any reading, and now any strings created with Read and ReadLine will have that encoding. But it is up to you to know, in some other way, what this encoding is supposed to be.

By default, a TextInputStream's encoding is UTF-8. This means that if you've been saving MacRoman strings to files in some earlier version of REALbasic, when you come to read those files using REALbasic 5 the strings may yield gibberish unless you explicitly set your TextInputStream encodings to MacRoman. This is an opportunity for breakage of your existing code as you migrate to REALbasic 5; be sure to search your code for TextInputStreams and set their encodings explicitly.

Similar considerations apply to databases. When you save text into a database, you're saving the bytestream; encoding information is lost. It's up to you, when reading the text out of the database later on, to know what encoding it's in, and to use defineEncoding to tell REALbasic about it.

GetFolderItem has long been deprecated as a means of specifying a FolderItem by its pathname, and now there's yet another reason for not misusing it in this way: GetFolderItem breaks if a pathname contains non-ASCII characters. It can still be used, however, with just a filename, as a way of specifying a FolderItem in the same folder as your application.

Styled EditFields

It appears that there is no way to save and restore styled Unicode text. This means that in the general case you cannot use REALbasic to write a TextEdit-like application. Styled EditFields can still be saved and restored, though, in the limited case where (1) all the characters fall within an old, pre-Unicode encoding, and (2) the EditField's TextFont corresponds to that encoding, in the sense that getFontTextEncoding(theEditField.textFont) yields that encoding. This isn't much comfort, since if the EditField is editable the user could conceivably enter any character into it, including a character that lies outside the EditField's TextFont encoding; we would then be back to the general case where saving and restoring doesn't work.

The reason for this breakage is that REALbasic provides no access to the system's means of describing Unicode styled text information ('utxt'/'ustl' and RTF); it accesses only the old ways of describing styled text, 'TEXT' and 'styl'. The 'styl' is what you access as an EditField's TextStyleData, and is saved and restored by FolderItem's SaveStyledEditField and OpenStyledEditField methods. But 'styl' doesn't work with Unicode; it works only with the older single-byte or WorldScript double-byte encodings. Since an EditField's text is Unicode, it has to be converted to one of those encodings in order for the 'styl' to match it, using a formula like this:

s = theEditField.text.convertEncoding (getFontTextEncoding (theEditField.textFont)) 

That is what SaveStyledEditField does, and it is what you would need to do if you wanted to extract the EditField's text and TextStyleData separately. But if that encoding conversion is impossible because the text contains characters that lie outside the target encoding, those characters will be lost.