UTF Explained: Decoding the Secrets of Unicode and Character Encoding
Represent euro sign(€) in utf8 and explain converting its binary format to UTF formats?
Sure, let’s take the example of the Euro sign, which has the Unicode code point U+20AC. The Unicode code point in decimal value is 8524
UTF-8 : The Euro sign (U+20AC) falls into the range of characters that require 3 bytes in UTF-8 encoding.
In binary, the Unicode code point is 00100000 10101100. To represent this code point in UTF-8 with 3 bytes, we use the following format:
For a 3-byte character in UTF-8, the format is 1110xxxx:10xxxxxx:10xxxxxx.
Now, let’s map the bits from the Unicode code point to the format:
- 1110: The first 4 bits of the UTF-8 representation come from the Unicode code point.
- 0010: The next 4 bits come from the Unicode code point.
- 00101011: The remaining 8 bits of the first byte come from the Unicode code point.
- 10: The next 2 bits are continuation bits for the second byte.
- 1000: The remaining 6 bits of the second byte come from the Unicode code point.
- 10: The next 2 bits are continuation bits for the third byte.
- 101100: The remaining 6 bits of the third byte come from the Unicode code point.
Putting it all together, the correct UTF-8 representation for U+20AC is 11100010:10000010:10101100. In hexa it is E2 82 AC
Why we need 3 bytes and not 2 or 4?
The reason why the Euro sign (U+20AC) needs 3 bytes in UTF-8 encoding is determined by the Unicode standard and the design of the UTF-8 encoding scheme.
UTF-8 is a variable-width encoding, which means that different characters require a different number of bytes for representation. The Unicode code points are divided into different ranges, and each range corresponds to a specific number of bytes in the UTF-8 encoding. Here’s a simplified breakdown:
- Basic Latin characters (U+0000 to U+007F): 1 byte
- Latin-1 Supplement characters (U+0080 to U+07FF): 2 bytes
- Basic Multilingual Plane characters (U+0800 to U+FFFF): 3 bytes
- Supplementary Planes characters (U+10000 to U+10FFFF): 4 bytes
• For 2 byte 110xxxxx:10xxxxxx. For 4 byte : 11110xxx:10xxxxxx:10xxxxxx:10xxxxxx
So if we see 11110 in start of byte, can we assume it 4 byte character?
Yes, that’s correct. In UTF-8, if you encounter a byte that starts with the bit pattern 11110xxx
, you can safely assume that it is the first byte of a 4-byte character. This byte signals the start of a 4-byte sequence in UTF-8 encoding cant we have 1 byte character which also begins with 11110
Cant we have 1 byte character which also begins with 11110?
No, in the standard UTF-8 encoding, a 1-byte character does not begin with the bit pattern 11110xxx. The 11110 pattern is specifically reserved for signaling the start of a 4-byte character sequence.
A 1-byte character in UTF-8 will begin with a different pattern. In UTF-8:
- A 1-byte character has the format 0xxxxxxx.
- A 2-byte character has the format 110xxxxx:10xxxxxx.
- A 3-byte character has the format 1110xxxx:10xxxxxx:10xxxxxx.
- A 4-byte character has the format 11110xxx:10xxxxxx:10xxxxxx:10xxxxxx.
Each of these patterns serves as a marker to indicate the number of bytes in the character sequence and how subsequent bytes are used.
So, if you see the 11110 pattern at the start of a byte in UTF-8, you can confidently conclude that it is the beginning of a 4-byte character.
Represent same euro sign in UTF16
The Euro sign (€) has the Unicode code point U+20AC. In UTF-16 encoding, this code point is represented using a single 16-bit code unit because it falls within the Basic Multilingual Plane (BMP). The UTF-16 representation of U+20AC is therefore 0x20AC
.
Let’s break down the binary representation:
U+20AC
in hexadecimal is20AC
.- In binary,
20
is0010 0000
andAC
is1010 1100
. - Concatenate them to get
0010 0000 1010 1100
. - In UTF-16, the character is represented as a 16-bit code unit. So, if it fits within the BMP, it is simply
0010 0000 1010 1100
.
In summary, the Euro sign (€) in UTF-16 is represented by the 16-bit code unit 0010 0000 1010 1100
.
Are there any character which take 2 byte in utf16 but 3 in utf8?
Yes, there are characters that require 2 bytes in UTF-16 encoding but 3 bytes in UTF-8 encoding. These characters typically fall in the range U+0800 to U+FFFF, where UTF-16 uses a single 16-bit code unit, but UTF-8 uses three bytes for representation.
An example of such a character is “CYRILLIC SMALL LETTER ABKHASIAN DZE” (ә) with the Unicode code point U+04D9. Let’s break down the encoding:
- In UTF-16:
- The UTF-16 encoding for U+04D9 is a single 16-bit code unit: 04D9.
- In UTF-8:
- The binary representation of U+04D9 is 0000 0100 1101 1001.
- The UTF-8 encoding would be 11100000:10001001:10101001.
Pros and cons of UTF-8 and UTF-16?
Memory Usage:
- UTF-8: Variable-length encoding, requiring 1-4 bytes per character. Smaller for ASCII and Western European characters, but potentially larger for CJK characters and emoji.
- UTF-16: Fixed-length encoding, using 2 bytes per character (surrogates for rare BMP characters). Offers consistent memory usage but can be inefficient for languages with many non-BMP characters.
Processing:
- UTF-8: More complex processing due to variable-length encoding, requiring byte-level decoding and character length determination. This can make certain operations, such as indexing and substring extraction, more complex and potentially slower.
- UTF-16: Simpler processing with fixed-length encoding, potentially faster for character access and iteration.
Storage:
- UTF-8: Generally smaller on average for diverse text with Latin characters, but potentially larger for CJK-heavy content.
- UTF-16: Can be smaller for CJK-heavy content, but larger for mixed languages due to potential surrogate pairs.
Java code to display complexity?
In the following code we are trying to find 5th character of string cafénoir. It should be ‘n’. Bytes wise, c,a and f will consume 3 bytes but é will consume 2 bytes. SO 5th character actually will start from 6th byte and not 5th. This code handles that : if ((utf8Bytes[byteIndex] & 0xC0) != 0x80)
public class UTF8RandomAccessExample {
public static void main(String[] args) {
// UTF-8 encoded "cafénoir"
byte[] utf8Bytes = "cafénoir".getBytes();
// Random access to the second character (index 1)
int index = 5;
int byteIndex = findByteIndexForCharacterIndex(utf8Bytes, index);
byte[] characterBytes = extractCharacterBytes(utf8Bytes, byteIndex);
// Decode the character back to a UTF-16 Java String
String character = new String(characterBytes, StandardCharsets.UTF_8);
System.out.println("Character at index " + index + ": " + character); // Output: "é"
}
private static int findByteIndexForCharacterIndex(byte[] utf8Bytes, int characterIndex) {
int byteIndex = 0;
int currentCharacterIndex = 0;
while (byteIndex < utf8Bytes.length && currentCharacterIndex < characterIndex) {
// Increment the character index based on the current byte
if ((utf8Bytes[byteIndex] & 0xC0) != 0x80) {
currentCharacterIndex++;
}
byteIndex++;
}
return byteIndex;
}
private static byte[] extractCharacterBytes(byte[] utf8Bytes, int byteIndex) {
int start = byteIndex;
// Find the end of the character (next byte that does not start with 10)…suppose string was caféénoir….then start will be byteIndex and byteIndex will become byteIndex+2 because 5th character é needs 2 bytes
while (byteIndex < utf8Bytes.length && (utf8Bytes[byteIndex] & 0xC0) == 0x80) {
byteIndex++;
}
return Arrays.copyOfRange(utf8Bytes, start, byteIndex);
}
}
Getbytes (utf-8)of café return [99, 97, 102, -61, -87]….what does it signify?
Let’s break down each byte:
- 99 corresponds to the ASCII code of the character ‘c’.
- 97 corresponds to the ASCII code of the character ‘a’.
- 102 corresponds to the ASCII code of the character ‘f’.
- -61 and -87 together represent the UTF-8 encoding of the character ‘é’. In binary, -61 is 11000011, and -87 is 10101001. Combining them, you get the UTF-8 encoding 11000011:10101001, which represents the character ‘é’.
So, the entire byte array [99, 97, 102, -61, -87] is the UTF-8 repr
In utf-16 it comes as [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23]
The array [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23]
in Java represents the UTF-16 encoding of the string “caf\u00E9”. In UTF-16, characters are represented by one or two 16-bit code units.
Let’s break down each pair of bytes:
-2, -1
represents the Byte Order Mark (BOM) for UTF-16. It indicates the byte order (little-endian or big-endian). In this case,-2
corresponds to0xFE
and-1
corresponds to0xFF
, indicating a little-endian byte order. The BOM is optional, and its presence is used to determine the byte order.0, 99
represents the UTF-16 encoding of the character ‘c’.//0, 97
represents the UTF-16 encoding of the character ‘a’.0, 102
represents the UTF-16 encoding of the character ‘f’.0, -23
represents the UTF-16 encoding of the character ‘é’. In binary,-23
is11101001
. Combining0
and-23
, you get the UTF-16 encoding11101001:00000000
, which represents the character ‘é’.
So, the entire array [-2, -1, 0, 99, 0, 97, 0, 102, 0, -23]
represents the UTF-16 Little-Endian encoding of the string “caf\u00E9”, where the first two bytes are the BOM, and the subsequent pairs of bytes represent the UTF-16 encoding of each character.
When we copy any character ,what does gEt copied…say i copy ‘a’ from this website…its binary format?
When you copy a character like ‘a’ from a website, what gets copied is its binary representation, as determined by the character encoding used by the website. Here’s a breakdown of the process:
1. Character Encoding:
- Websites typically use UTF-8 encoding, capable of representing a wide range of characters.
- In UTF-8, the character ‘a’ (lowercase Latin a) is encoded as a single byte:
01100001
(binary).
2. Copy Operation:
- When you hit “copy,” the browser performs the following:
- Retrieves the binary representation of ‘a’ from the website’s content.
- Places this binary data into a temporary storage area, often called the clipboard.
3. Paste Operation:
- When you paste, the browser reads the binary data from the clipboard and interprets it based on the current encoding context:
- If pasting into a UTF-8 environment (e.g., most modern text editors or word processors), it displays the character ‘a’.
- If pasting into an environment using a different encoding, it might display a different character or a placeholder symbol, depending on the compatibility of the encodings.