What is difference between charset and encoding?

Charset

As the name suggests it actually is a set of characters. Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations.A ‘character set’ is just what it says: a properly-specified list of distinct characters.
There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it assigned a unique identifier or number called as code point. In computer, these code points will be represented by one or more bytes.

Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)

Character Encoding

Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.
Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.

Consider this – Character ‘क’ in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding.In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of “FC” while in “UTF-8” it represented as “C3 BC” and in UTF-16 as “FE FF 00 FC”.

Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.

On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)

Uday Ogra

Connect with me at http://facebook.com/tendulkarogra and lets have some healthy discussion :)

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *