How are characters represented in UTF-8 encoding
UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode. UTF-8 encodes each of the 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
To know more about UTF-8 encoding do read Difference between UTF-8, UTF-16 and UTF-32 Encoding
It is variable length encoding. Number of of bytes used to represent a character depends upon the code point of the character. In UTF-8 higher order bits are very important.The leading high order bits of first byte tells how many bytes have been used to encode the value. Lets see how each byte in multi-byte representation looks like
- One Byte Character
ASCII characters (U+0000 to U+007F) take 1 byte. The first bit of the byte has to be 0
This gives 7 bits to encode the actual character
2. Two Bytes Character
Code points U+0080 to U+07FF take 2 bytes. First 3 bits of first byte hast to be 110 and first 2 bits of second byte has to be 10
This gives 11 bits to encode the actual character
3. Three Bytes Character
Code points U+0800 to U+FFFF take 3 bytes. First 4 bits of first byte are 1110, first 2 bits of second and third byte has to be 10.
This gives 16 bits to encode the actual character
4. Four Bytes Character
Code points U+10000 to U+10FFFF take 4 bytes. First 5 bits of first byte are 11110.
Lets convert hexadecimal value 001FACBD to UTF-8 representation. As this number is greater than FFFF, it will require 4 bytes to encode. Here is normal byte representation of this character
Now we will encode it using UTF-8 encoding. This is the structure where we have to fit in the original bits
We will start from the right and 6 right most bits of original character will go to 6 non-masking bits of 4th byte of UTF-8 structure. Similarly next 6 will go to 3rd byte and next 6 to 2nd. Last 3 bits will go to first byte
So in total 3+6+6+6(21) bits will be used out of total 32 bits as rest of 11 bits are masking bits.
Similarly if you have UTF-8 bytes, you can convert them into real code point and character by reversing this process
Now that you know how UTF-8 works, I am leaving a question for you. Best of luck 🙂