How are characters represented in UTF-16 format
In UTF-16 is variable length encoding which requires either 2 bytes or 4 bytes to represent a character. It is better than UTF-32 in the sense size of files will be around half the bytes
Let is see how character ‘A’ will be represented in UTF-16.
When UTF-16 was announced varios companies started implementing it in their own ways.So letter ‘A’ can be written in either of these ways in 2-byte format.
2nd format could be a bit more efficient in particular scenarios. First format was known as Big Endian and second format was known as Little Endian.
A visual example: The word “Example” in different encodings (UTF-16 with BOM):
Now if computer knows that given data is in UTF-16 format, how will it decide if it is Little Endian and Big Endian
To solve this additional non character data BOM(Byte order mark) was introduced. So, if the first two bytes of a UTF-16 encoded text file are
FF, the encoding is UTF-16BE. For
FE, it is UTF-16LE.
BOM is not compulsory. If not found software will assume some format and try parsng it. If parsing fails it will try with other format.