How are characters represented in UTF-16 format

In UTF-16 is variable length encoding which requires either 2 bytes or 4 bytes to represent a character. It is better than UTF-32 in the sense size of files will be around half the bytes


Let is see how character ‘A’ will be represented in UTF-16.

When UTF-16 was announced varios companies started implementing it in their own ways.So letter ‘A’ can be written in either of these ways in 2-byte format.

2nd format could be a bit more efficient in particular scenarios. First format was known as Big Endian and second format was known as Little Endian.

A visual example: The word “Example” in different encodings (UTF-16 with BOM):

Now if computer knows that given data is in UTF-16 format, how will it decide if it is Little Endian and Big Endian

To solve this additional non character data BOM(Byte order mark) was introduced. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.

 

BOM is not compulsory. If not found software will assume some format and try parsng it. If parsing fails it will try with other format.

Uday Ogra

Connect with me at http://facebook.com/tendulkarogra and lets have some healthy discussion :)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *