Using Unicode with Python

Computers really work only with numbers; binary numbers. But to communicate with people computers had to have some ability to input and output words almost from day one.

The trick has always been to provide a mapping of numbers to characters. There were several standards, including EBCDIC by IBM, but the Ascii character set emerged as a clear winner in the race for a standard by the late 1970's. It provides for 127 characters, of which about 96 are printable. These include most punctuation characters, decimal digits and both the upper and lower case alphabet. In addition, low numbers from 0 to 32 are used for control characters such as space, tab, and newline.

Most computers use an 8 bit byte to store a character (or more accurately the number representing the character) but there have been exceptions. The DEC PDP-10, a timesharing computer we used at the University of Oregon in the 70's, had a word size of 36 bits and would store 5 characters per word with a bit left over.

If 8 bit bytes are used to store a character that leaves 1 bit unused, effectively doubling the number of possible characters. And there are lots of candidates. Most European languages need several letters beyond those provided in the basic Ascii set, and some like Greek and Russian need entire alphabets. Individually, many of these needs can be accomodated but not all at the same time. This led to a variety of extensions to Ascii, mutually incompatible. We will write a bit of Python code that generates a web page to illustrate this in a bit.

Octal and Hexidecimal numbers

Computers actually do everything in binary numbers but reading binary numbers is hard on people. For example decimal 133 is "01000101" in binary (the 3 '1' bits adding up to 128+4+1=133) but one would hardly know to look at it. Humans process information better in fewer but bigger chunks. Octal and hexidecimal numbers are a convenient compromise. We get smaller chunkier numbers that still let us see the binary pattern in the number.

Octal (base 8) uses the digits 0-7 and is directly transformed from binary 3 bits at a time. Octal became popular since several early computers with 16 bit word size, used 3 bit fields for fields within a machine instruction. In the PDP-11, for example, 0001000101000011 moves the contents of register 5 to register 3. If this number is first chunked into 0 001 000 101 000 011 it becomes 010503 in octal. The fields are 01=move; 05 and 03 are the registers. In hexidecimal the same number chunks into 0001 0001 0100 0011 and is represented by 1143. That's not as useful.

Other computers use 4 bit chunks for fields. The DEC VAX-11 which basically replaced the PDP-11 over time has a 32 bit word size. It has 16 registers instead of the 8 on the PDP-11. So hexidecimal is a natural choise to represent Vax instructions. Hexidecimal needs 16 digits and uses 0-9 plus A-F (either case).

Both octal and hexidecimal may be used to represent 1 byte characters. In Python any character, especially a non-printing character may be input to a string by using its 3 digit octal value following a '\'. It may also be input with its 2 digit hexidecimal value following a '\x'. Python displays non-printing characters in a string in either format with newer versions of Python preferring hexidecimal and older versions octal.

```>>> a = '\x41bc'
>>> a
'Abc'
>>> '\x10bc'
'\020bc'
>>>
```

For character data I find hexidecimal better. Exactly 2 digits are needed for a byte and if a number occupies multiple bytes the digits break evenly on the byte boundaries. We will stick with hexidecimal in this tutorial.

A look at some encodings

Different encodings are just different mappings of numbers to characters. Many go by "ISO-8859-x" where "x" is 1 to 15 or so. These encodings generally also have a more common name such as "Ascii", "Latin-1", "Nordic", etc.

The following program table.py creates a small web page with a table. The table is 16x16 cells, each containing a character and its corresponding hexidecimal value. Load table.html into your browser, either Netscape, Mozilla or Internet Explorer. (Warning: Opera has problems with foreign character sets) Next go to the view menu and choose "character encoding" if you are using Netscape, "encoding" if Internet Explorer or "character coding" for Mozilla. Select various 8 bit sets such as ISO-8859-1, ISO-8859-15, Cyrllic (windows), Greek etc. Don't worry about UTF-8 or UTF-16 if you see them on the menu. Just ignore them for now.

As you select alternate encodings, notice how the first half of the table stays the same but how the second half (hex 80-ff) changes quite a bit from one encoding to another. For example, You may find that hexidecimal "DF" is the Russian letter Я in ISO-8859-5 or "Cyrillic" but something else in another encoding.

With these 8 bit encodings it is possible to mix English with one or more other european languages, but not two foreign languages together. It is also difficult for the software to keep track of which encoding is in use in a particular document.

An entirely different problem comes up with languages like Chinese and Japanese which use thousands of characters within a single language.

Enter Unicode

There is a simple and obvious way to get around all of this. Stop trying to cram all of the characters in all of the worlds languages into 8 bits. Basically, Unicode uses 16 bit numbers to assign a unique number to every character in every alpabet. 16 bits yields 65536 possible values. With Unicode you can have Russian, Greek and any other language all in the same document with no confusion.

For convenience Unicode uses the values 00-7f for the same charaters as Ascii. In addition, the character values 80-ff match those in the iso-8859-1 encoding.

You can find out everything about Unicode here. Click on "Charts" to see whole sets of encodings.

Unicode and Python

Python supports Unicode strings whose individual characters are 16 bits. A Unicode string literal is preceded with a 'u'. For example

```>>> a = u'abcd'
>>> print a
abcd
```

Characters in the Unicode string may be simple Ascii characters which map to Unicode characters with the same value or they may be extended characters. Any extended character (whose value is greater than 0x7F) may be input with the string '\uxxxx' where xxxx is a 4 digit hexidecimal number. Python will display it in the same format. For example

```>>> a = u'abc\uabcd'
>>> a
u'abc\uABCD'
>>> len(a)
4
>>> a[2]
u'c'
>>> a[3]
u'\uABCD'
>>>
```

The hexidecimal number 'ABCD' above is a single character. Unfortunately we can't see directly what character in what language it represents from Python's interactive mode. However we'll see how to view it shortly in a web browser.

There are some Python builtin functions for Unicode. The function `unicode` will convert 8 bit encodings into Unicode. It takes 2 arguments, a normal string with 8 bit characters and a string describing the encoding. For example

```>>> a = 'abc\x81'
>>> unicode(a, 'iso-8859-1')
u'abc\x81'
>>>
```

The function converts a string of 8-bit character to one of 16-bit characters. The result is the corresponding Unicode string. If the encoding is 'ascii' then the function will complain if any character is above 7F hex.

```>>> unicode(a, 'ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
```

Now consider the following

```>>> unicode('abc\xe4', 'iso-8859-1')
u'abc\xe4'
>>> unicode('abc\xe4', 'iso-8859-5')
u'abc\u0444'
```

In the iso-8859-1 encoding (Central European) the 8 bit character is mapped to the same value in Unicode (hex e4) which represents the character "ä" but in iso-8859-5 \xe4 is mapped to Unicode \x0444 which represents the russian character "я". The second argument in the unicode function call may be omitted if it is 'Ascii'.

With ordinary strings two functions convert a single character string to and from its numeric value. The function "ord" works with both ordinary and Unicode strings returning the numeric value of a character.

```>>> print ord(u'A')
65
>>> print ord(u'\u0444')
1092
```

The inverse function for ordinary strings is "chr" which returns a string of length one for a numeric value in the range 0-255. A new function "unichr" returns a Unicode string of length one for a number value 0-65535.

```>>> chr(0xef)
'\357'
>>> unichr(1092)
u'\u0444'
```

There is much more support built into Python for Unicode strings. The "string" module will split, join, strip, etc Unicode strings just like ordinary strings. In fact you can mix them much like you might mix intergers and longs in numeric expressions. Python will first convert the ordinary string into a (more general) Unicode string and then perform the operation requested. Here is an example of splitting and joining strings.

```>>> a = u'abcd:efg'
>>> string.split(a, u':')
[u'abcd', u'efg']
>>> string.join(['abcd',u'efg'],':')
u'abcd:efg'
>>> a == 'abcd:efg'
1
```

This also works with other modules like "re" that deal with strings.

Encoding Unicode strings

The inverse of the `unicode` function is to invoke the `encode` method on a Unicode string. The result is an ordinary string in the 8 bit encoding. Unicode strings are objects. Here is an example.

```>>> a = u'abcd\u0444'
>>> a.encode('iso-8859-5')
'abcd\344'
>>>
```
Remember that in Unicode 0x444 is the Russian character "я"

and corresponds to 0xe4 in 8 bit iso-8859-5.

New encodings for Unicode

Files on the disk consist of byte streams, 8 bits each, and in order to handle Unicode in a file we need to use more than one byte per character. The two most common encodings are "utf-16" and "utf-8", each with some advantages.

Utf-16 simply represents Unicode values in two bytes each. So our Unicode character "я"=u"\u0444" becomes 0x04 0x44 in utf-16. Or possibly 0x44 0x04. The first is in big endian and the second in little endian format. Which one is being used is determined by the first two bytes of a Utf-16 string (or file). They are either 0xFF 0xFE or 0xFE 0xFF.

Utf-16 is a good encoding to use for Japanese or Chinese with their huge character sets since 2 bytes per character is good fit. But it's not so good for English or most other european languages. With English alternate bytes are basically Ascii and zeros in between. So half the space is wasted.

Utf-8 encoding is very effective in dealing with this. Utf-8 encodes a Unicode character in one or more bytes, but only as many as are needed to represent characters number. For normal Ascii characters that is a single byte. So Utf-8 strings and files containing only characters 0-7f are exactly the same. Above that range Utf-8 uses its own values which are carefully chosen so that it can be determined exactly how many bytes to grab for the next character. If the first byte is 0x80 or above at least one more byte will follow and depending on the value in the 2nd byte a third byte might follow and so on. For example the Korean character represented in Unicode as u'\uC5D0' is the 3 byte sequence 0xec 0x97 0x90 in Utf-8.

So you can see that Utf-8 is more compact for european languages where most characters require a single byte except where two bytes are required for accents, umlauts, etc. But if most or all of the text is in Chinese then Utf-16 with 2 bytes per character will beat Utf-8 for compactness.

Examples of UTF-8 and UTF-16

Sometimes the best way to understand something is to simply see some examples. Lets start with a little piece of UTF-16 which contains both Ascii characters and some japanese characters. With the aid of a little program decode.py we can look at the file byte by byte. The program replaces bytes with value 0 (about half) with a '.' and bytes above 0x80 with their hexidecimal value bracketed by '-'s. The result may be viewed without the html being interpreted by the browser. The english part is fairly readable.

Notice that the first two bytes are 0xFF an 0xFE which specify the byte order of all the characters to come. The Ascii compatible characters have their Ascii value in the first byte and zero in the following byte.

Notice the meta tag contains 'content="text/html; charset="utf-16"'. This is picked up by the browser and controls how the page is interpreted. Without it the page does not display.

Notice too, that when we get into the japanese characters themselves there are no "zero" bytes. The first character is "S0" (the character zero, not the value), which together make the hexidecimal value 0x3053 which is the Japanese hiragana .

Utf-8 is really the more interesting of the two common Unicode encodings. Our little decode.py works fine for utf-8 as well. Here a sample in html and the same text decoded. Thanks to "Joel on Software" from whose pages I lifted little snippets. Check out his Unicode writeup here.

A little application

I enjoy foreign languages but my keyboard is standard American. In order to type Russian or even German I would normally have to jump through some hoops. But here's some code to make it a bit easier.

In German the special characters ä, ö, ü and ß may be represented by the pairs ae, oe, ue and ss. This is an old trick that dates back to Morse code, I believe, and is how I generally type German. With a small table german.py and a Python program utf8.py any string with such pairs will be transformed into a utf-8 string with these pairs properly encoded.

The program transforms stdin to stdout looking for text between pairs of xml tags (<german>...</german>, <russian>...</russian> and <unicode>.</unicode>). In the case of german and russian any amount of text can be transformed. With the "unicode" tag only a single 4 digit hex unicode value is encoded to utf-8. In addition, the meta tag with 'charset="utf-16"' is inserted into the header so your browser interpretes the codes correctly.

The function `encode`imports `table` from a module passed or, if no module is passed, translates a single Unicode character into utf-8. Besides german, the table russian.py lets me type russian in semi-phonetic latin characters like Ya nye znal yego ochyen khorosho (I didn't know him very well) and have it come out Я не знал его очен кhорошо. (Actually, I'm missing a soft sound at the end of очен.) Incidentally, If you do a "view source" on this page you will see that it is utf-8 and in fact I used this program to convert this writeup from its original html (generated from lore) into what you are now viewing.

Index