UTF Explained: Decoding the Secrets of Unicode and Character Encoding

by marjavamitjava · February 4, 2024

Represent euro sign(€) in utf8 and explain converting its binary format to UTF formats?

Sure, let’s take the example of the Euro sign, which has the Unicode code point U+20AC. The Unicode code point in decimal value is 8524

UTF-8 : The Euro sign (U+20AC) falls into the range of characters that require 3 bytes in UTF-8 encoding.

In binary, the Unicode code point is 00100000 10101100. To represent this code point in UTF-8 with 3 bytes, we use the following format:

For a 3-byte character in UTF-8, the format is 1110xxxx:10xxxxxx:10xxxxxx.

Now, let’s map the bits from the Unicode code point to the format:

1110: The first 4 bits of the UTF-8 representation come from the Unicode code point.
0010: The next 4 bits come from the Unicode code point.
00101011: The remaining 8 bits of the first byte come from the Unicode code point.
10: The next 2 bits are continuation bits for the second byte.
1000: The remaining 6 bits of the second byte come from the Unicode code point.
10: The next 2 bits are continuation bits for the third byte.
101100: The remaining 6 bits of the third byte come from the Unicode code point.

Putting it all together, the correct UTF-8 representation for U+20AC is 11100010:10000010:10101100. In hexa it is E2 82 AC

Why we need 3 bytes and not 2 or 4?

The reason why the Euro sign (U+20AC) needs 3 bytes in UTF-8 encoding is determined by the Unicode standard and the design of the UTF-8 encoding scheme.

UTF-8 is a variable-width encoding, which means that different characters require a different number of bytes for representation. The Unicode code points are divided into different ranges, and each range corresponds to a specific number of bytes in the UTF-8 encoding. Here’s a simplified breakdown:

Basic Latin characters (U+0000 to U+007F): 1 byte
Latin-1 Supplement characters (U+0080 to U+07FF): 2 bytes
Basic Multilingual Plane characters (U+0800 to U+FFFF): 3 bytes
Supplementary Planes characters (U+10000 to U+10FFFF): 4 bytes

• For 2 byte 110xxxxx:10xxxxxx. For 4 byte : 11110xxx:10xxxxxx:10xxxxxx:10xxxxxx

So if we see 11110 in start of byte, can we assume it 4 byte character?

Yes, that’s correct. In UTF-8, if you encounter a byte that starts with the bit pattern 11110xxx, you can safely assume that it is the first byte of a 4-byte character. This byte signals the start of a 4-byte sequence in UTF-8 encoding cant we have 1 byte character which also begins with 11110

Cant we have 1 byte character which also begins with 11110?

No, in the standard UTF-8 encoding, a 1-byte character does not begin with the bit pattern 11110xxx. The 11110 pattern is specifically reserved for signaling the start of a 4-byte character sequence.

A 1-byte character in UTF-8 will begin with a different pattern. In UTF-8:

A 1-byte character has the format 0xxxxxxx.
A 2-byte character has the format 110xxxxx:10xxxxxx.
A 3-byte character has the format 1110xxxx:10xxxxxx:10xxxxxx.
A 4-byte character has the format 11110xxx:10xxxxxx:10xxxxxx:10xxxxxx.

Each of these patterns serves as a marker to indicate the number of bytes in the character sequence and how subsequent bytes are used.

So, if you see the 11110 pattern at the start of a byte in UTF-8, you can confidently conclude that it is the beginning of a 4-byte character.

Represent same euro sign in UTF16

The Euro sign (€) has the Unicode code point U+20AC. In UTF-16 encoding, this code point is represented using a single 16-bit code unit because it falls within the Basic Multilingual Plane (BMP). The UTF-16 representation of U+20AC is therefore 0x20AC.

Let’s break down the binary representation:

U+20AC in hexadecimal is 20AC.
In binary, 20 is 0010 0000 and AC is 1010 1100.
Concatenate them to get 0010 0000 1010 1100.
In UTF-16, the character is represented as a 16-bit code unit. So, if it fits within the BMP, it is simply 0010 0000 1010 1100.

In summary, the Euro sign (€) in UTF-16 is represented by the 16-bit code unit 0010 0000 1010 1100.

Are there any character which take 2 byte in utf16 but 3 in utf8?

Yes, there are characters that require 2 bytes in UTF-16 encoding but 3 bytes in UTF-8 encoding. These characters typically fall in the range U+0800 to U+FFFF, where UTF-16 uses a single 16-bit code unit, but UTF-8 uses three bytes for representation.

An example of such a character is “CYRILLIC SMALL LETTER ABKHASIAN DZE” (ә) with the Unicode code point U+04D9. Let’s break down the encoding:

In UTF-16:

The UTF-16 encoding for U+04D9 is a single 16-bit code unit: 04D9.

In UTF-8:

The binary representation of U+04D9 is 0000 0100 1101 1001.
The UTF-8 encoding would be 11100000:10001001:10101001.

Pros and cons of UTF-8 and UTF-16?

Memory Usage:

UTF-8: Variable-length encoding, requiring 1-4 bytes per character. Smaller for ASCII and Western European characters, but potentially larger for CJK characters and emoji.
UTF-16: Fixed-length encoding, using 2 bytes per character (surrogates for rare BMP characters). Offers consistent memory usage but can be inefficient for languages with many non-BMP characters.

Processing:

UTF-8: More complex processing due to variable-length encoding, requiring byte-level decoding and character length determination. This can make certain operations, such as indexing and substring extraction, more complex and potentially slower.
UTF-16: Simpler processing with fixed-length encoding, potentially faster for character access and iteration.

Storage:

UTF-8: Generally smaller on average for diverse text with Latin characters, but potentially larger for CJK-heavy content.
UTF-16: Can be smaller for CJK-heavy content, but larger for mixed languages due to potential surrogate pairs.

Java code to display complexity?

In the following code we are trying to find 5^th character of string cafénoir. It should be ‘n’. Bytes wise, c,a and f will consume 3 bytes but é will consume 2 bytes. SO 5^th character actually will start from 6^th byte and not 5^th. This code handles that : if ((utf8Bytes[byteIndex] & 0xC0) != 0x80)

public class UTF8RandomAccessExample {

    public static void main(String[] args) {

        // UTF-8 encoded "cafénoir"

        byte[] utf8Bytes = "cafénoir".getBytes();

        // Random access to the second character (index 1)

        int index = 5;

        int byteIndex = findByteIndexForCharacterIndex(utf8Bytes, index);

        byte[] characterBytes = extractCharacterBytes(utf8Bytes, byteIndex);

        // Decode the character back to a UTF-16 Java String

        String character = new String(characterBytes, StandardCharsets.UTF_8);

        System.out.println("Character at index " + index + ": " + character); // Output: "é"

    }

    private static int findByteIndexForCharacterIndex(byte[] utf8Bytes, int characterIndex) {

        int byteIndex = 0;

        int currentCharacterIndex = 0;

        while (byteIndex < utf8Bytes.length && currentCharacterIndex < characterIndex) {

            // Increment the character index based on the current byte

            if ((utf8Bytes[byteIndex] & 0xC0) != 0x80) {

                currentCharacterIndex++;

            }

            byteIndex++;

        }

        return byteIndex;

    }

    private static byte[] extractCharacterBytes(byte[] utf8Bytes, int byteIndex) {

        int start = byteIndex;

        // Find the end of the character (next byte that does not start with 10)…suppose string was caféénoir….then start will be byteIndex and byteIndex will become byteIndex+2 because 5th character é needs 2 bytes

        while (byteIndex < utf8Bytes.length && (utf8Bytes[byteIndex] & 0xC0) == 0x80) {

            byteIndex++;

        }

        return Arrays.copyOfRange(utf8Bytes, start, byteIndex);

    }

}

Getbytes (utf-8)of café return [99, 97, 102, -61, -87]….what does it signify?

Let’s break down each byte:

99 corresponds to the ASCII code of the character ‘c’.
97 corresponds to the ASCII code of the character ‘a’.
102 corresponds to the ASCII code of the character ‘f’.
-61 and -87 together represent the UTF-8 encoding of the character ‘é’. In binary, -61 is 11000011, and -87 is 10101001. Combining them, you get the UTF-8 encoding 11000011:10101001, which represents the character ‘é’.

So, the entire byte array [99, 97, 102, -61, -87] is the UTF-8 repr

In utf-16 it comes as [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23]

The array [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23] in Java represents the UTF-16 encoding of the string “caf\u00E9”. In UTF-16, characters are represented by one or two 16-bit code units.

Let’s break down each pair of bytes:

-2, -1 represents the Byte Order Mark (BOM) for UTF-16. It indicates the byte order (little-endian or big-endian). In this case, -2 corresponds to 0xFE and -1 corresponds to 0xFF, indicating a little-endian byte order. The BOM is optional, and its presence is used to determine the byte order.
0, 99 represents the UTF-16 encoding of the character ‘c’.//
0, 97 represents the UTF-16 encoding of the character ‘a’.
0, 102 represents the UTF-16 encoding of the character ‘f’.
0, -23 represents the UTF-16 encoding of the character ‘é’. In binary, -23 is 11101001. Combining 0 and -23, you get the UTF-16 encoding 11101001:00000000, which represents the character ‘é’.

So, the entire array [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23] represents the UTF-16 Little-Endian encoding of the string “caf\u00E9”, where the first two bytes are the BOM, and the subsequent pairs of bytes represent the UTF-16 encoding of each character.

When we copy any character ,what does gEt copied…say i copy ‘a’ from this website…its binary format?

When you copy a character like ‘a’ from a website, what gets copied is its binary representation, as determined by the character encoding used by the website. Here’s a breakdown of the process:

1. Character Encoding:

Websites typically use UTF-8 encoding, capable of representing a wide range of characters.
In UTF-8, the character ‘a’ (lowercase Latin a) is encoded as a single byte: 01100001 (binary).

2. Copy Operation:

When you hit “copy,” the browser performs the following:
- Retrieves the binary representation of ‘a’ from the website’s content.
- Places this binary data into a temporary storage area, often called the clipboard.

3. Paste Operation:

When you paste, the browser reads the binary data from the clipboard and interprets it based on the current encoding context:
- If pasting into a UTF-8 environment (e.g., most modern text editors or word processors), it displays the character ‘a’.
- If pasting into an environment using a different encoding, it might display a different character or a placeholder symbol, depending on the compatibility of the encodings.

UTF Explained: Decoding the Secrets of Unicode and Character Encoding

You may also like...

Database Design for Cricket Scoring App with Efficient Ball Information and Flexible Extra Handling

From Simple to Expressive: Java Switch Evolution: Java 13 and Beyond

Building a Comprehensive Professional Network: Designing a LinkedIn-Like System with User Profiles, Job Recommendations, Groups, and Company Pages