CopperSpice API: Unicode

Explanation about the most widely accepted standard for encoding text. More...

Classes
class	QByteArray
	Stores a sequence of bytes More...

class	QChar32
	Implements a 32-bit Unicode code point More...

class	QLocale
	Formats data based on a given language or country More...

class	QRegularExpression< S >
	Provides pattern matching using regular expressions More...

class	QRegularExpressionMatch< S >
	Provides the results of matching a QRegularExpression for a given string More...

class	QString16
	Provides a UTF-16 string class More...

class	QString8
	Provides a UTF-8 string class More...

class	QStringList
	Provides a container which is optimized for strings More...

class	QStringParser
	Provides functionality for parsing a string More...

class	QStringView< S >
	String view class More...

class	QTextBoundaryFinder
	Provides a way of finding Unicode text boundaries in a string More...

class	QTextStream
	Interface for reading and writing text More...

Detailed Description

Unicode is an international encoding standard which supports the majority of written and spoken languages where each letter or symbol is assigned a unique numeric value. Nearly all software applications and operating systems provide some level of support for Unicode.

Introduction to Unicode

A character set is a collection of symbols. The set does not associate any values to these symbols, it can be thought of as an unordered list of symbols. A good example of a character set is the Latin character set used by English and most European languages. The Latin character set should not be confused with Latin-1 which is a coded character set. A coded character set is the combination of a character set and a character map. A character map associates values to each symbol. A good example of a coded character set is ASCII or ISO-8859-1.

A code point is the basic or atomic unit of strings. It is a numerical value defined by the Unicode standard associated with a given symbol. When working with strings it is very important to think in terms of code points and not characters since they are not the same. By definition, a code point value requires a 32-bit integer for storage.

In the Latin-1 coded character set the value for the capital letter A is 41 hex. This is one code point.

In Unicode there is a symbol called "rightwards arrow with corner downwards" and it has a value of 21b4 hex. The symbol looks like ↴ and it is also exactly one code point.

The other definition that is needed is a storage unit. This describes the unit of storage required for an encoded code point. In UTF-8 the storage unit is 8-bits. In UTF-16 the storage unit is 16-bits. In UTF-32 the storage unit is 32 bits.

In UTF-8 the Latin-1 capital letter A is represented by one storage unit. In UTF-16 the Latin-1 capital letter A is represented by two storage units since this is the smallest size for any code point in UTF-16. In both character encodings the capital letter A is still exactly one code point.

In UTF-8 the "rightwards arrow with corner downwards" is represented by three storage units. In UTF-16 the "rightwards arrow with corner downwards" is represented by two storage units. In both character encodings this symbol is still exactly one code point.

Both UTF-8 and UTF-16 are variable length encodings. The calculations required to encode a UTF-8 code point are simpler than in UTF-16. In addition, the concept of big-endian versus little-endian does not apply to UTF-8.

The choice by most computer languages to use UTF-16 was made between 1992 and 1995. In more recent years several operating systems as well as the Unicode consortium have established this was a mistake and UTF-8 is truly a far better choice. Developers have realized the majority of the romance languages use a Latin alphabet and most of the symbols can be represented using 8-bits. It does not make sense to use UTF-16 since most of the code points can be represented in one UTF-8 storage unit.

All text in your application should be stored using QString8 as it natively supports UTF-8 which is the preferred Unicode encoding.

String Classes

The classes listed on this page are useful when working with strings or rendering text.

It is not necessary to know the full extent of Unicode in order to develop a software product which complies with Unicode. However, if your application deals at all with strings or text a solid understanding of the fundamentals of Unicode is strongly advisable.

For additional information about processing text refer to the following:

Rich Text Processing

XML Processing.

Internationalization

Codecs and Streams

All file I/O should be done using QTextStream.

Use QKeyEvent::text() for keyboard input in custom widgets. To translate from other encodings to QString8 you can use QTextCodec.

Unicode in Depth

The following is a list of documents which cover Unicode in much greater detail.