Prior to the days of networking, sharing data, and the Internet, most software treated a byte as a character and every character was 8 bits. The majority of data was confined to a single computer or at the least a a single spoken language. The ASCII character set combined with a language specific code page worked pretty well for most applications. Every language simply defined their own code page, which is a table for the characters beyond 128.

With the advent of the Internet data started moving between computers, countries, and languages. Anyone working with strings in their applications found the task of internationalization nearly impossible to support. The sheer number of code pages was out of control and difficult to maintain.

Strings are fundamental to almost all programming tasks, applications, databases, and operating systems. The following are some of the challenges when dealing with strings in a multilingual environment.

Incorrect use of strings is the leading cause of security vulnerabilities.

Most programs are poorly tested for internationalization support.

Some languages support multiple encoding formats, others force users to use one specific encoding.

In C++ the native string class does not support any encoding.

The solution, which began to solve many of the compatibility issues was accomplished by the development and adoption of Unicode. The Unicode standard was first released in 1991 by the Unicode Consortium. Initially Unicode was entirely a 16-bit character encoding. UCS-2 was the first standard and every character occupies exactly 16-bits. Many languages and operating systems took the approach of implementing UCS-2 in their 16-bit string classes to support Unicode.

A few years later UCS-4, which is a 32-bit encoding, was released as they discovered it really does take 32-bits to fully represent all of the characters and symbols which appear in the union of all languages.

As time progressed members of the Unicode consortium started to realize that although it would require a 32-bit representation for each character to have a unique value, using 16-bits or 32-bits was inefficient. To resolve this dilemma it was decided an 8-bit encoding solution was the best practice. There was a belief among the Unicode team that UTF-8 would prevail. Unfortunately, many systems and languages were already entrenched in the 16-bit storage format and companies did not want to redesign their string classes.

When UTF-16 replaced UCS-2 in 1996 problems started to resurface again. UTF-16 encodes some characters as a single 16-bit value and others as a pair of 16-bit values using 32-bits of storage. The problem is a 16-bit value can be a whole character or half of a character. This causes issues with 16-bit string classes. Twenty years later there are still major challenges and bugs with string operations using 16-bit string classes.

Developers have asked for a better approach and we believe CsString is the answer to this dilemma in C++.