CsString  1.3.2
String Terminology

The following terminology will be helpful to understand the complexity of encoded characters and how they are stored and retrieved.

A character set is a collection of symbols. A character set does not associate any values to these symbols, it is an unordered list of the symbols. As an example, the Latin character set is used in English and most European languages, whereas the Greek character set is used only by the Greek language.

The values associated with the character set are traditionally called the character encoding which is very confusing terminology. A better term for this is the character map.

The combination of a character set and a character map is called coded character set.

A code point or code position is a character encoding term which refers to the numerical values defined in the Unicode standard. It is a 32-bit integer data type where the lower 21-bits represent a valid code point and the upper 11 bits are 0.

A code unit or storage unit is a term to describe the unit of storage for an encoded code point. In UTF-8 the code unit is 8-bits and in UTF-16 the code unit is 16-bits.

As an example, the letter A has a code point value of U+0041. In UTF-8 this is represented by one byte. In UTF-16 this is represented by two bytes. The letter A is one code point and one storage unit in either character encoding.

The rightwards arrow with corner downwards character has a code point value of U+21B4. In UTF-8 this is represented by three bytes, which is three storage units. In UTF-16 the same code point is represented by two bytes, which is one storage unit. The character is always exactly one code point.

The Basic Multilingual Plane (BMP) is the first 64k code points in Unicode. It contains characters for almost all modern languages and a large number of symbols. This is the set of characters which fit into 2 bytes in UTF-16.

ASCII

The American Standard Code for Information Interchange is a 7-bit coded character set and was finalized in 1968. This set of 128 characters from 00 to 7F matches the corresponding Unicode code points. The name ASCII is often incorrectly used to refer to various 8-bit coded character sets which just happen to include the ASCII characters in the first 128 code points.

The name ASCII was deprecated by IANA in 1992. The name US-ASCII is now preferred to avoid confusion with the 8-bit character sets.

Latin-1

The "Latin Alphabet Number 1", also known as the ISO-8859-1, is an 8-bit coded character set. The first edition of this standard was published in 1987. This set consists of 191 characters from the Latin script which were later used in the first 256 code points in Unicode.

The Latin-1 character set is a superset of the ASCII standard. This character set is used in the US, Western Europe, and much of Africa. There are many other ISO Latin character sets which support Central European, Greek, Hebrew and other languages.

Latin-9

The "Latin Alphabet Number 9", also known as the ISO-8859-15. The first edition of this standard was published in 1999. It is very similar to Latin-1 but replaces some less common symbols with the Euro sign and other symbols used Finnish and Estonian.

Unicode

Unicode code points are by definition 32-bits. Working with Unicode code points there is no choice, everything is a 32-bit character. The Unicode Consortium realized the majority of the romance languages use the Latin alphabet and most of these symbols can be represented using 8-bits. The remainder of the symbols need 16-bits or 32-bits. Most likely, the Unicode team decided it would not make sense to expect everyone to use a 32-bit character encoding when most text can be represented in 8-bits or 16-bits.

Companies like Microsoft may have selected a text encoding without really thinking things through and they elected to adopt UTF-16 as the native encoding for Unicode on Windows. Languages like Java and the Qt frame work did the same thing, 16-bit encoding seemed attractive and the correct choice. Languages, operating systems, and application developers learned from the struggles of existing string implementations and realized UTF-8 was the better option.

The reason UTF-16 makes little sense is because not only is it wasteful for the majority of people who only need ASCII, it is also a pain to use, especially if you want to be able to generate data which is portable and can be used cross-platform. UTF-16 is a variable length encoding because you can not represent every 32-bit code point set using 16-bits. The majority of usable code points will fit into 16-bits, but it is misleading to say that Unicode can be represented in a 16-bit format.

Of course, UTF-8 is also a variable length encoding. It is a much better encoding since there are numerous code points which only require one byte instead of two bytes in UTF-16. The same comparison can be made between a code point which requires two bytes in UTF-8 versus four bytes in UTF-16. Any variable length encoding will require deciphering how many bytes comprise a singe code point. This logic is simpler for UTF-8 and since the storage units are individual bytes there is no concept of big-endian versus little-endian.

Overall, UTF-16 is the worst encoding choice since it is both variable length and too wide. It creates a lot of confusion and rarely implemented correctly.

Character Encodings

UTF-8 is the most widely used encoding for text in web pages and databases. There is a wide push among many communities to migrate all data interchange to use UTF-8.

There are over 100 encodings defined by other groups, however Unicode has narrowed its support to the following encodings: UTF-8, UTF-16 or UTF-32.


Encoding Description # of Bytes Notes
UCS-2 1991 Supports only characters in the Basic Multilingual Plane
Obsolete in Unicode 8.0, Aug 2015
Uses 16-bit storage Superseded by UTF-16
UCS-4 1993 Defined in the ISO-10646 standard, not Unicode
According to Unicode, identical to UTF-32
Uses 32-bit storage Superseded by UTF-32
UTF-7 Developed for email, never widely used, problematic and inefficient 1 byte for most ASCII characters
Up to 8 bytes for other code points
UTF-8 1996 Most widely used encoding for text in web pages and databases 1 byte for ASCII characters
2 bytes for characters in various alphabetic blocks
3 bytes for the rest of the Basic Multilingual Plane
4 bytes - supplementary characters
Is backward compatible with ASCII
UTF-9 & UTF-18 Encodings were proposed in an RFC released on April 1 2005. Some portions were implemented in PDP-10 assembly Intended as an April Fools Joke
UTF-16 1996 Used in many languages like Java, Qt, .NET environments, and many operating systems 2 bytes for any character in the Basic Multilingual Plane
4 bytes for supplementary characters
Not backward compatible with ASCII
UTF-32 2001 Used mostly when working with glyphs rather than strings of characters 4 bytes for all characters Not backward compatible with ASCII