ABCDE - Alvin's Blog ...: Unicode

What is the relationship between Unicode and UTF-8 ? This article "What Is UTF-8 And Why Is It Important?" is quite good for reading. Anyway, I want to further explain a little bit.

Actually, the so-called Unicode for most of the time is referring to the mapping table which map a number (called code point) to a character. The number can be a small number between 0 to 127. This number will map to the normal ASCII characters. For example, the decimal number 65 (hex 41) map to English letter A.

The number can be a little larger, which need 16-bit to represent it. Then, this number will map to, such as, some commonly used Chinese character. For example, the decimal number 20,013 (hex 4E2D) map to Chinese character 中.

If the number is further larger (which need 20-bit to represent it), this number will map to, such as, some rarely used Chinese characters. For example, the decimal number 194,712 (hex 2F898) map to Chinese character 𦇚

So far, I am just talking about the mapping and the size (16-bit & 20-bit) of the number. I have not mentioned how to store this number.

At the first glance, you can simply store each character as a 20-bit number. But this is a waste of storage because there are many unused ZEROs. For example, the English letter A is ASCII-65. Its 1-byte representation is 0x41. Using 32-bit representation will be 0x00000041. There are many unused ZEROs at the beginning. For Chinese characters, those commonly used characters only need 2 bytes. If using 32-bit representation, the first 16-bit will also become unused ZEROs

To tackle this problem, a method UTF-8 is developed. This method will store the Unicode code point number in 1-byte, 2-byte or 3-byte, depending on the VALUE of the number. If the code point is a small number, i.e. those ASCII characters code points, only 1 byte will be used. About 1900+ code points (usually for those European letters) need 2 bytes. For those code points for commonly used Chinese characters, 3-byte is used.

It can be said that, under the UTF-8 method, the Unicode character table is divided into several areas. Those characters in the area related to ASCII will use 1-byte to store. About 1900+ characters in another area will use 2-byte to store. Most commonly used Chinese characters will use 3-byte to store.

Take an experiment, cut and paste the following line to PSPad :

abcdefghijαβγδ中文字

Then, choose Format > UTF-8 and then save the file.

The file size is 27 bytes. The 10 English letters consume 10 x 1 = 10 bytes. The 4 Greek letters consume 4 x 2 = 8 bytes. The 3 Chinese characters consume 3 x 3 = 9 bytes. 10 + 8 + 9 = 27 bytes. You can open the file using Firefox 3.0.8 which can auto-detect UTF-8 to open the file.

(By the way, you can save it using UTF-16 LE encoding by chooseing Format > UTF-16 LE. The file size will be 36 byte. The 17 characters consumes 17 x 2 = 34 bytes. Plus the BOM marker FFFE total to 2 + 34 = 36 bytes. Firefox can also auto-detect UTF-16 LE to open the file.)

If simply looking at Chinese characters, this UTF-8 method seems quite "expensive" because it uses 3 bytes to store those commonly used Chinese characters. And, those Chinese characters is a 16-bit (2-byte) code point. In other words, there is a 8-bit (1-byte) overhead (i.e. 50% overhead) for each Chinese character. Yes, this is the truth. Therefore, if the information is a pure Chinese character, using UTF-8 will make the file size much larger than using UTF-16 !

But, UTF-8 has many advantages.

First of all, UTF-8 is backward compatible to all ASCII encoding. The ASCII can be treated as a subset of UTF-8. All English characters are represented using 1 byte, the same as ASCII. Therefore, all existing software or else handling ASCII can also handle UTF-8.

Secondly, there is no unused ZEROs in UTF-8. In C programming language, the null character ZERO is the string terminator. If there is such a ZERO in the stream, it will terminate the string. Then, this unused ZERO will make many C program behave not as expected. UTF-8 encoding method ensure that there is no unused ZEROs. Many C program can run as usual.

Also, UTF-8 can define the character BOUNDARY. In UTF-8, if the code point is for ASCII character, it is stored as 0zzzzzzz (where zzzzzzz is the ASCII value ranging from 0 to 127). For those code point needing 2-byte representation, it is stored as 110yyyzz 10zzzzzz (where yyyzzzzzzzz is the code point). For those code point (e.g. for Chinese character) which need a 3-byte representation, it is stored as 1110yyyy 10yyyyzz 10zzzzzz . In other words, all 16-bit Unicode character is either a (0-something) or (110-something 10-something) or (1110-something 10-something 10-something) representation in UTF-8. For a more technical details, please refer to the web page UTF-8 in Wikipedia.

Furthermore, in information transmission, UTF-8 can help detecting error. For example, if a new character is beginning with 1110-something, 3-byte is expected. There should be two 10-something bytes follows. If you cannot find 2 more 10-something bytes, there must be something wrong. For another example, a 2-byte representation is (110-something 10-something). It is impossible to have this sequence (110-something 0-something). There must be something wrong.

At last, although this article uses only 1-byte, 2-byte and 3-byte as example, in theory, UTF-8 can be expanded into 4-byte, 5-byte, ... etc. For example, the 4-byte representation will be 11110xxx 10xxyyyy 10yyyyzz 10zzzzzz. This 4-byte representation can denote 21-bit Unicode characters.

Other useful references:

UTF-8, UTF-16, UTF-32 & BOM

This web page [ How to display and edit Chinese on English Windows systems ] is worth reading. It can make an English Windows system to be able to work with English and Chinese.

This web page [ Enter Chinese Characters under Windows XP - March 18, 2002 ] also has some information with some screen captures.

Well the above articles tell you how to ENABLE your OS to let you input Chinese char and to display Chinese char. Once, the proper font is used, the Chinese char is displayed happily. However, what is the more in-depth knowledge inside ?

Actually, Windows XP is using NTFS filesystem. This filesystem uses unicode as filenames. In other words, all filenames characters (no matter Chinese, Japanese, Korean, English, Greek, ... etc) as stored as unicode UTF-16 format as the filename. This is the underlying filesystem filename data structure.

Then, how the win32 applications interact with this UTF-16 unicode filesystem ? Unfortunately, there are 2 sets of APIs. For those unicode-aware applications, it will use the unicode API to directly handle the unicode filename. For those non-unicode aware application, there are troubles.

When the non-unicode application handles a filename of a string of double-byte, how to convert each double-byte to a unicode ? The non-unicode API will use MBCS codec. The MBCS codec together with locale information will convert a double-byte to a particular unicode char. With the same double-byte char, changing the locale will change the unicode char.

What is the default locale your XP is using ? It is already determined in Control Panel -> Regional and Language Options -> Advanced -> Language for non-Unicode programs.

This comes another troublesome situation. For example, if your system is of default locale being Traditional Chinese, you can still make a file with Japanese char using unicode aware program. Then, how the non-unicode aware program handle this filename ? This Japanese unicode char should map to a double-byte of the Japanese locale. But, the default locale is Traditional Chinese, not Japnese. What to do then ? Sometimes they simply cannot handle this file, or will display question mark in the filename.

For the details of the MBCS and the API, please refer to this web page [ All About Python and Unicode ].

ABCDE - Alvin's Blog ...

Saturday, January 2, 2010

Unicode Related Articles

Friday, January 1, 2010

ASCII, Code Page and Unicode

Unicode and UTF8

Thursday, December 31, 2009

English Windows Systems with Chinese

Duplicate Open Current Folder in a New Window