Friday, January 1, 2010

Unicode and UTF8

What is the relationship between Unicode and UTF-8 ? This article "What Is UTF-8 And Why Is It Important?" is quite good for reading. Anyway, I want to further explain a little bit.

Actually, the so-called Unicode for most of the time is referring to the mapping table which map a number (called code point) to a character. The number can be a small number between 0 to 127. This number will map to the normal ASCII characters. For example, the decimal number 65 (hex 41) map to English letter A.

The number can be a little larger, which need 16-bit to represent it. Then, this number will map to, such as, some commonly used Chinese character. For example, the decimal number 20,013 (hex 4E2D) map to Chinese character 中.

If the number is further larger (which need 20-bit to represent it), this number will map to, such as, some rarely used Chinese characters. For example, the decimal number 194,712 (hex 2F898) map to Chinese character 𦇚

So far, I am just talking about the mapping and the size (16-bit & 20-bit) of the number. I have not mentioned how to store this number.

At the first glance, you can simply store each character as a 20-bit number. But this is a waste of storage because there are many unused ZEROs. For example, the English letter A is ASCII-65. Its 1-byte representation is 0x41. Using 32-bit representation will be 0x00000041. There are many unused ZEROs at the beginning. For Chinese characters, those commonly used characters only need 2 bytes. If using 32-bit representation, the first 16-bit will also become unused ZEROs

To tackle this problem, a method UTF-8 is developed. This method will store the Unicode code point number in 1-byte, 2-byte or 3-byte, depending on the VALUE of the number. If the code point is a small number, i.e. those ASCII characters code points, only 1 byte will be used. About 1900+ code points (usually for those European letters) need 2 bytes. For those code points for commonly used Chinese characters, 3-byte is used.

It can be said that, under the UTF-8 method, the Unicode character table is divided into several areas. Those characters in the area related to ASCII will use 1-byte to store. About 1900+ characters in another area will use 2-byte to store. Most commonly used Chinese characters will use 3-byte to store.

Take an experiment, cut and paste the following line to PSPad :

abcdefghijαβγδ中文字

Then, choose Format > UTF-8 and then save the file.

The file size is 27 bytes. The 10 English letters consume 10 x 1 = 10 bytes. The 4 Greek letters consume 4 x 2 = 8 bytes. The 3 Chinese characters consume 3 x 3 = 9 bytes. 10 + 8 + 9 = 27 bytes. You can open the file using Firefox 3.0.8 which can auto-detect UTF-8 to open the file.

(By the way, you can save it using UTF-16 LE encoding by chooseing Format > UTF-16 LE. The file size will be 36 byte. The 17 characters consumes 17 x 2 = 34 bytes. Plus the BOM marker FFFE total to 2 + 34 = 36 bytes. Firefox can also auto-detect UTF-16 LE to open the file.)

If simply looking at Chinese characters, this UTF-8 method seems quite "expensive" because it uses 3 bytes to store those commonly used Chinese characters. And, those Chinese characters is a 16-bit (2-byte) code point. In other words, there is a 8-bit (1-byte) overhead (i.e. 50% overhead) for each Chinese character. Yes, this is the truth. Therefore, if the information is a pure Chinese character, using UTF-8 will make the file size much larger than using UTF-16 !

But, UTF-8 has many advantages.

First of all, UTF-8 is backward compatible to all ASCII encoding. The ASCII can be treated as a subset of UTF-8. All English characters are represented using 1 byte, the same as ASCII. Therefore, all existing software or else handling ASCII can also handle UTF-8.

Secondly, there is no unused ZEROs in UTF-8. In C programming language, the null character ZERO is the string terminator. If there is such a ZERO in the stream, it will terminate the string. Then, this unused ZERO will make many C program behave not as expected. UTF-8 encoding method ensure that there is no unused ZEROs. Many C program can run as usual.

Also, UTF-8 can define the character BOUNDARY. In UTF-8, if the code point is for ASCII character, it is stored as 0zzzzzzz (where zzzzzzz is the ASCII value ranging from 0 to 127). For those code point needing 2-byte representation, it is stored as 110yyyzz 10zzzzzz (where yyyzzzzzzzz is the code point). For those code point (e.g. for Chinese character) which need a 3-byte representation, it is stored as 1110yyyy 10yyyyzz 10zzzzzz . In other words, all 16-bit Unicode character is either a (0-something) or (110-something 10-something) or (1110-something 10-something 10-something) representation in UTF-8. For a more technical details, please refer to the web page UTF-8 in Wikipedia.

Furthermore, in information transmission, UTF-8 can help detecting error. For example, if a new character is beginning with 1110-something, 3-byte is expected. There should be two 10-something bytes follows. If you cannot find 2 more 10-something bytes, there must be something wrong. For another example, a 2-byte representation is (110-something 10-something). It is impossible to have this sequence (110-something 0-something). There must be something wrong.

At last, although this article uses only 1-byte, 2-byte and 3-byte as example, in theory, UTF-8 can be expanded into 4-byte, 5-byte, ... etc. For example, the 4-byte representation will be 11110xxx 10xxyyyy 10yyyyzz 10zzzzzz. This 4-byte representation can denote 21-bit Unicode characters.

Other useful references:
  1. UTF-8, UTF-16, UTF-32 & BOM

No comments:

Duplicate Open Current Folder in a New Window

Sometimes after I opened a folder in Win7, I would like to duplicate open the same folder again in another explorer window. Then, I can ope...