Difference between UTF-8, UTF-16 and UTF-32 Encoding

UTF-32

In UTF-32 all of characters are coded with 32 bits. The main advantage of UTF-32 is that the Unicode code points are directly indexable. Finding the Nth code point in a sequence of code points is a constant time operation. So it is easy to calculate the length of the string
.The disadvantage is that for each ASCII characters you waste an extra three bytes.

UTF-8

In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is thee bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string’s length. But it’s takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.
ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.

UTF-16

UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. It has disadvantage of being variable length, but hasn’t got the advantage of saving as much space as UTF-8.Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.

Examples

A small example where UTF-16 is actually better than UTF-8:

Consider the Chinese letter “語” – its UTF-8 encoding is:

11101000 10101010 10011110

While its UTF-16 encoding is shorter:

10001010 10011110

This is hexadecimal representation of string ‘Hello World’ for each encoding

And this is for chinese equivalent of ‘Hello World’, ‘你好世界’

Another example :

Of those three, clearly UTF-8 is the most widely spread.

Uday Ogra

Connect with me at http://facebook.com/tendulkarogra and lets have some healthy discussion :)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *