Character Set Issues & Unicode

Web Architecture (INFO 290-03)

Erik Wilde, UC Berkeley School of Information
2007-09-25
Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

Every character-based document is based on some model of which characters are available, and how they are encoded. Unicode is the most popular character set today and provides a variety of encoding schemes, each of them being a Unicode Transformation Format (UTF). In addition to character sets and encodings, other issues relevant when dealing with characters are transcoding and normalization, which deal with the problems arising when using different character encodings or different encodings of particular characters.

Outline (Characters)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Characters and Computers

Characters

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape […]

The Unicode Standard, Version 4.0, Addison-Wesley, 2003

Glyphs

[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.

ISO/IEC 9541:1991, Information Technology – Font Information Interchange

Outline (Character Sets)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

History of Character Sets

ASCII 1963

ASCII 1963

ASCII 1965

ASCII 1965

ASCII 1967

ASCII 1967

EBCDIC

EBCDIC

EBCDIC (1964)

Augmented EBCDIC

Augmented EBCDIC

Beyond ASCII

ISO 8859

ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

ISO 8859-1 (Latin-1)

Latin-1 (Western European)

ISO 8859-2 (Latin-2)

Latin-2 (Central European)

ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

ISO 8859-4 (Latin-4)

Latin-4 (North European)

ISO 8859-5 (Cyrillic)

Cyrillic

ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

ISO 8859-7 (Greek)

Greek

ISO 8859-15 (Latin-9)

Latin-9

Outline (Unicode Basics)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

ISO 8859 Problems

Unicode

Unicode Character Count

BMP Structure

bmp.png

Roadmap of Unicode BMP. Each numbered box represents 256 characters. (Source: Wikipedia)

  •  Black  = Latin scripts and symbols
  •  Light Blue  = Linguistic scripts
  •  Blue  = Other European scripts
  •  Orange  = Middle Eastern and SW Asian scripts
  •  Light Orange  = African scripts
  •  Green  = South Asian scripts
  •  Purple  = Southeast Asian scripts
  •  Red  = East Asian scripts
  •  Light Red  = Unified CJK Han
  •  Yellow  = Canadian Aboriginal scripts
  •  Magenta  = Symbols
  •  Dark Grey  = Diacritics
  •  Light Grey  = UTF-16 surrogates and private use
  •  Cyan  = Miscellaneous characters
  •  White  = Unused

Unicode Encodings

AאU+233B4.gif
Code pointU+0041U+05D0U+597DU+233B4
UTF-841D7 90E5 A5 BDF0 A3 8E B4
UTF-1600 4105 D059 7DD8 4C DF B4
UTF-3200 00 00 4100 00 05 D000 00 59 7D00 02 33 B4

UTF-8

Other UTFs

Character Set Identification

Outline (Normalization and Transcoding)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Unicode is Complex

<français>français ≠ français</français>
<français>français ≠ français</français>

Normalization Examples

Compatibility Composites

Transcoding

Outline (Conclusions)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]

Text is Complex