How are characters represented in UTF-8 encoding

UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode. UTF-8 encodes each of the 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

To know more about UTF-8 encoding do read Difference between UTF-8, UTF-16 and UTF-32 Encoding

It is variable length encoding. Number of of bytes used to represent a character depends upon the code point of the character. In UTF-8 higher order bits are very important.The leading high order bits of first byte tells how many bytes have been used to encode the value. Lets see how each byte in multi-byte representation looks like

  1. One Byte Character

ASCII characters (U+0000 to U+007F) take 1 byte. The first bit of the byte has to be 0

This gives 7 bits to encode the actual character

2. Two Bytes Character

Code points U+0080 to U+07FF take 2 bytes. First 3 bits of first byte hast to be 110 and first 2 bits of second byte has to be 10

This gives 11 bits to encode the actual character

3. Three Bytes Character

Code points U+0800 to U+FFFF take 3 bytes. First 4 bits of first byte are 1110, first 2 bits of second and third byte has to be 10.

This gives 16 bits to encode the actual character

4. Four Bytes Character

Code points U+10000 to U+10FFFF take 4 bytes. First 5 bits of first byte are 11110.

 

Example

Lets convert hexadecimal value 001FACBD to UTF-8 representation. As this number is greater than FFFF, it will require 4 bytes to encode. Here is normal byte representation of this character

Now we will encode it using UTF-8 encoding. This is the structure where we have to fit in the original bits

We will start from the right and 6 right most bits of original character will go to 6 non-masking bits of 4th byte of UTF-8 structure. Similarly next 6 will go to 3rd byte and next 6 to 2nd. Last 3 bits will go to first byte

So in total 3+6+6+6(21) bits will be used out of total 32 bits as rest of 11 bits are masking bits.

Similarly if you have UTF-8 bytes, you can convert them into real code point and character by reversing this process

 

Now that you know how UTF-8 works, I am leaving a question for you. Best of luck 🙂

Uday Ogra

Connect with me at http://facebook.com/tendulkarogra and lets have some healthy discussion :)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *