Unicode Support and the IUnicode Class

Overview of Unicode Support

Many Open Class classes use the Unicode character encoding standard to represent text data internally. Unicode, a fixed-width, 16-bit character encoding system, contains codes for every character in every major world script, along with a wide set of symbols, punctuation, and control characters. Because the Unicode system can store and access every character, regardless of its script or natural language, it lets you manipulate text more easily than in environments that require multiple code pages to support different character sets.

The Unicode support classes let you query the properties associated with individual Unicode character values. These properties, provided implicitly by the Unicode character encoding standard, include:

information about the script (for example, Latin or Cyrillic)
information about the character's character set (for example, symbols or control characters)
semantic information, such as whether a character is a digit or is uppercase, lowercase, or uncased.

The IUnicode Class

The primary class in the Unicode support classes is IUnicode, which lets you determine a character's script and character properties. The Unicode support classes also provide a mechanism for referencing specific Unicode character values by name instead of by codepoint values.

IUnicode provides a set of static functions that check a Unicode character, represented by the datatype UniChar, for a specific property--for example, querying whether a character is an uppercase character, a digit, or one of the space characters. These functions let you check a character for a specific property without requiring you to know all the possibilities. You can test for a space character, for example, with the IUnicode::IsASpace function without needing to know the full set of Unicode characters used to represent a space.

The class library also provides a set of classes that contain enumerated names for each Unicode character value. These classes correspond to groups of characters based on script or functions: ULatin, UGreek, UDingbats, UMathematicalOperators, and so on. Use the names enumerated in these classes to reference specific Unicode character values.

Character Support

The Unicode character set provides full character coverage for the major scripts listed below, as well as for punctuation, symbols, and control characters. The character set for each script is independent--even if a character appears in multiple scripts, it has a separate code within each script. For example, the character A has one code for the Roman alphabet, another code for the Greek alphabet, and yet another code for the Cyrillic alphabet. However, because more than one language may use a given alphabet, the character A is represented by the same code for English, French, and, in fact, all languages that use the Roman alphabet.

Supported Scripts

Arabic	Georgian	Hangul	Malayam	Thai
Armenian	Greek	Hebrew	Oriya	Zhuyinfuhao
Bengali	Gujarati	Kana	Roman
Cyrillic	Gurmukhi	Kannada	Tamil
Devanagari	Han	Lao	Telugu

Reserved Areas

The Unicode standard sets aside a range of characters--from U+E000 to U+F8FF--for private use for:

special characters or sets of characters not included in the Unicode set
assigning specific semantics to a character

By convention, this area is divided into an end-user zone, which begins at U+E000 and ascends toward higher niumbers, and a corporate use zone, which begins at U+F8FF and descends toward lower numbers. The purpose of this convention is to minimize conflicting assignments within the private use area.