Abstract

Every character-based document is based on some model of which characters are available, and how they are encoded. Unicode is the most popular character set today and provides a variety of encoding schemes, each of them being a Unicode Transformation Format (UTF). In addition to character sets and encodings, other issues relevant when dealing with characters are transcoding and normalization, which deal with the problems arising when using different character encodings or different encodings of particular characters.

Outline (Characters)

Characters [3]
Character Sets [10]
Unicode Basics [8]
Normalization and Transcoding [3]
Conclusions [1]

Characters and Computers

American Standard Code for Information Interchange (ASCII)
- for the first time a basic set of characters had a universally accepted encoding
- many Internet protocols (such as HTTP) encode their information in ASCII commands
ASCII is a very limited repertoire of characters
- basic ASCII contains 128 characters (7 bit) with a number of control chars
- no variants of characters (german umlauts, french accents) are supported
- various code pages extending ASCII to 8 bit exist and are hard to distinguish
Character is not a trivial concept when regarded globally
- european languages all have writing systems based on a small number of atoms
- other languages and writing systems have vastly different ideas of language atoms

Characters

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape […]

The Unicode Standard, Version 4.0, Addison-Wesley, 2003

The alphabetic approach is only one of several possibilities
- A character in Japanese hiragana and katakana scripts corresponds to a syllable (usually a combination of consonant plus vowel)
- Korean Hangul combines symbols for individual sounds of the language into square blocks, each of which represents a syllable; depending on the user and the application, either the individual symbols or the syllabic clusters can be considered to be characters
- In Indic scripts each consonant letter carries an inherent vowel that is eliminated or replaced using semi-regular or irregular ways to combine consonants and vowels into clusters; depending on the user and the application, either individual consonants or vowels, or the consonant or consonant-vowel clusters can be perceived as characters
- Arabic and Hebrew vowel sounds are typically not written at all; when they are written they are indicated by the use of combining marks placed above and below the consonantal letters

Glyphs

[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.

ISO/IEC 9541:1991, Information Technology – Font Information Interchange

Visual rendering introduces the notion of a glyph.
There is not a one-to-one correspondence between characters and glyphs
- A single character can be represented by multiple glyphs (each glyph is then part of the representation of that character); these glyphs may be physically separated from one another
- A single glyph may represent a sequence of characters (this is the case with ligatures, among others)
- A character may be rendered with very different glyphs depending on the context
- A single glyph may represent different characters (e.g. capital Latin A, capital Greek A and capital Cyrillic A)

Outline (Character Sets)

Characters [3]
Character Sets [10]
Unicode Basics [8]
Normalization and Transcoding [3]
Conclusions [1]

History of Character Sets

Text documents need ways to represent characters
- computers handle bits, not characters
- to handle characters, computers need a mapping from characters to bits
For a long time, computers were doing their work in a very isolated way
- I think there is a world market for maybe five computers. (¬ T. J. Watson)
With more computers being used, more data is exchanged between computers
Data rot happens on all levels (media, formats, applications)
Standardization of character sets started in the 60's
- ASCII was the first generally accepted character set
- EBCDIC was invented and marketed by IBM (and a terribly designed character encoding)
- ISO 8859 was the first attempt to better support character sets beyond ASCII
- asian scripts were always a problem because of the number of characters they need

EBCDIC

EBCDIC (1964)

Augmented EBCDIC

Beyond ASCII

ASCII is called ASCII for a reason
- it works well for english-speaking countries
- the majority of other languages cannot be represented
Character sets and the 8 bit computer start to collide
- ASCII is very convenient because characters and bytes correspond 1:1
- every character set expanding ASCII will make this more complicated
- complications can occur within and/or outside of the character set
Introducing a character set beyond 8 bit is a fundamental change
- dealing with and counting bytes is a seductively simple idea
Introducing several 8 bit character sets saves the 8 bit world
- by introducing several character sets, each of them can remain 8 bit
- the complexity has now been shifted to the handling of various character sets

ISO 8859

A family of character sets rather that a single character set
- each ISO 8859 family member is an 8 bit character set (256 characters)
- the lower half (128 characters) are always the same (ASCII)
- the upper half is supporting different user groups and changes between versions
ISO 8859 files cannot be identified by inspection
- ASCII characters can always be safely interpreted (identical on all ISO 8859 code pages)
- the upper half can only be interpreted if the code page is well-known

ISO 8859 environment must carefully track the code pages being used

failure to do so results in misinterpretation of characters

It also shows the Euro sign € which is part of ISO 8859-15 (Latin-9), but not included in ISO 8859-1 (Latin-1).

It also shows the Euro sign ¤ which is part of ISO 8859-15 (Latin-9), but not included in ISO 8859-1 (Latin-1).

ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

Latin-1 (Western European)

Latin-2 (Central European)

ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

Latin-4 (North European)

Cyrillic

ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

Greek

Latin-9

Outline (Unicode Basics)

Characters [3]
Character Sets [10]
Unicode Basics [8]
Normalization and Transcoding [3]
Conclusions [1]

ISO 8859 Problems

One document can only contain characters from one character set
- mixing characters from different sets is impossible without additional mechanisms
```
In Latin-9, Latin-1's currency symbol ¤ has been replaced with the Euro sign €.
```
An increasing number of character sets does not make life easier
- in particular, if they sometime differ only slightly (e.g., Latin-1 vs. Latin-9)
For bigger character sets, the 8 bit approach is not working at all
- the ISO 8859 approach allows only 128 special characters (the lower half is ASCII)
ISO 8859 is as good as it gets with 8 bit
- to improve this approach, the 8 bit philosophy must be abandoned

Unicode

Generalization takes characters beyond one and even two bytes
- Unicode has been designed to cover all characters of the world
- Unicode recently added its 100'000^th character
- for handling this character set, a more structured approach is required
Unicode cleanly separates various conceptual steps
- characters are collected and are then part of the character repertoire
- characters are then identified by a unique code point (written as U+0041)
- a Character Encoding Scheme (CES) then maps the Coded Character Set (CCS) based on a Character Encoding Form (CEF)
XML is ASCII for the 21^st century
- purists sometimes consider Unicode too big or dangerous
- Unicode is well-established and is necessary in a globalized economy

Unicode Character Count

Unicode has the ability to encode 17 × 2¹⁶ = 1'114'112 characters
- this means that currently ~10% of the available space is used
Characters are organized into 17 planes of 2¹⁶ = 65'536 characters
Planes are numbered from 0 to 16
Plane 0 is the Basic Multilingual Plane (BMP)
- all characters which in practical use today are part of the BMP (well, almost …)
Planes beyond the BMP contain rare and historic characters
- Old Italic, Deseret, Byzantine Musical Symbols
- most space within these astral planes is empty

BMP Structure

Roadmap of Unicode BMP. Each numbered box represents 256 characters. (Source: Wikipedia)

Black = Latin scripts and symbols
Light Blue = Linguistic scripts
Blue = Other European scripts
Orange = Middle Eastern and SW Asian scripts
Light Orange = African scripts
Green = South Asian scripts
Purple = Southeast Asian scripts
Red = East Asian scripts
Light Red = Unified CJK Han
Yellow = Canadian Aboriginal scripts
Magenta = Symbols
Dark Grey = Diacritics
Light Grey = UTF-16 surrogates and private use
Cyan = Miscellaneous characters
White = Unused

Unicode Encodings

	A	א	好
Code point	U+0041	U+05D0	U+597D	U+233B4
UTF-8	41	D7 90	E5 A5 BD	F0 A3 8E B4
UTF-16	00 41	05 D0	59 7D	D8 4C DF B4
UTF-32	00 00 00 41	00 00 05 D0	00 00 59 7D	00 02 33 B4

UTF-8

UTF-8 is one of the two standardized encodings for XML
- every ASCII document by definition is a UTF-8 document
- UTF-8 must be supported by every XML implementation
UTF-8 is not trivial, but it is widely supported and easy to implement
- there is no 1:1 correspondence between bytes and characters
- each Unicode character is encoded by 1-6 bytes
- UTF-8 is good for europeans (1 byte per ASCII character)
Non-Unicode documents must be transcoded to UTF-8
- keeping track of resource character encodings is a good idea

Other UTFs

UTF-16 stores every character as 2 or more bytes
- BMP characters are stored as 2 bytes
- astral plane characters are stored as 4 bytes
- UTF-16 is the other encoding (in addition to UTF-8) required by XML
UTF-32 stores every character as 4 bytes
- very simple and very inefficient (ASCII text volume increases by 300%)
Multi-byte formats introduce the problem of byte order
- UTF-16/32BE and UTF-16/32LE are stored with guaranteed endian
- UTF-16/32 may use a Byte Order Mark (BOM) (U+FEFF) to detect the endian

Character Set Identification

HTTP, XML, and HTML support character set identification
- HTTP supports the Content-Type header field
```
Content-Type: text/html; charset=utf-8
```
- XML encodes the character set in the XML declaration
```
<?xml version="1.0" encoding="utf-8"?>
```
- HTML supports the meta element in the document's head
```
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
```
Different identifications serve different purposes
Conflicting identifications are a sign of management problems

Outline (Normalization and Transcoding)

Characters [3]
Character Sets [10]
Unicode Basics [8]
Normalization and Transcoding [3]
Conclusions [1]

Unicode is Complex

Many languages have characters which are composed
- german umlauts are vowels with a double dot
- french accents and the cedilla are also added to regular characters
Unicode contains composed as well as composing characters
- for most languages, composed characters are considered to be regular characters
- in some circumstances, it might be required to compose a character out of a base and a diacritical mark
- as a result, the question arises how to define the equality of these variants
Unicode defines normal forms which prescribe one variant
- a complex field with different concepts of equivalence (canonical and compatibility)
- based on the equivalence forms, there are four normalization forms

<français>français ≠ français</français>

<francÌ§ais>francÌ§ais â‰  franÃ§ais</franÃ§ais>

Normalization Examples

Transcoding

Handling Unicode in an unconstrained environment is not easy
- Unicode's size and variability make character processing harder then it used to be
- for some applications with limited character need, this might be too much
In all text-based environments, well-defined rules must be defined for
1. the encoding of documents (maybe if variants, such as Unicode normalization forms)
2. accepted incoming encodings and how they are mapped to the internal encoding
3. available outgoing encodings and how they can be requested and will be generated
Transcoding is the activity of changing the encoding of data
- ideally, transcoding should be lossless and round-trip proof
- in any scenario with non-trivial encodings, this is not an easy goal
- even staying with one encoding can be a problem (if there are variations allowed)
Transcoding is essential for maintaining data quality

Outline (Conclusions)

Characters [3]
Character Sets [10]
Unicode Basics [8]
Normalization and Transcoding [3]
Conclusions [1]

Text is Complex

Even unstructured text is complex when generalizing the concept
Most application have parts where they manage text-based data
For simple applications, simple character sets may be appropriate
Unicode provides a large character repertoire, but at a certain cost
Handling text encodings must be well-defined
- this is not trivial when thinking about a general text-based application
- once rule have been defined and documented, things usually go well

Character Set Issues & Unicode

Web-Based Publishing (INFO 290-19)

Erik Wilde, UC Berkeley School of Information
2007-02-01

Abstract

Outline (Characters)

Characters and Computers

Characters

Glyphs

Outline (Character Sets)

History of Character Sets

ASCII 1963

ASCII 1965

ASCII 1967

EBCDIC

Beyond ASCII

ISO 8859

ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

Outline (Unicode Basics)

ISO 8859 Problems

Unicode

Unicode Character Count

BMP Structure

Unicode Encodings

UTF-8

Other UTFs

Character Set Identification

Outline (Normalization and Transcoding)

Unicode is Complex

Normalization Examples

Transcoding

Outline (Conclusions)

Text is Complex

Character Set Issues & Unicode

Web-Based Publishing (INFO 290-19)

Erik Wilde, UC Berkeley School of Information2007-02-01

Abstract

Outline (Characters)

Characters and Computers

Characters

Glyphs

Outline (Character Sets)

History of Character Sets

ASCII 1963

ASCII 1965

ASCII 1967

EBCDIC

Beyond ASCII

ISO 8859

ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

Outline (Unicode Basics)

ISO 8859 Problems

Unicode

Unicode Character Count

BMP Structure

Unicode Encodings

UTF-8

Other UTFs

Character Set Identification

Outline (Normalization and Transcoding)

Unicode is Complex

Normalization Examples

Transcoding

Outline (Conclusions)

Text is Complex

Erik Wilde, UC Berkeley School of Information
2007-02-01