Character Set Issues & Unicode

Web Architecture [./]
Fall 2008 — INFO 290-03 (CCN 42584)

Erik Wilde, UC Berkeley School of Information
2008-09-25

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: Character Set Issues & Unicode

Contents

E. Wilde: Character Set Issues & Unicode

(2) Abstract

Every character-based document is based on some model of which characters are available, and how they are encoded. Unicode is the most popular character set today and provides a variety of encoding schemes, each of them being a Unicode Transformation Format (UTF). In addition to character sets and encodings, other issues relevant when dealing with characters are transcoding and normalization, which deal with the problems arising when using different character encodings or different encodings of particular characters.



Characters

Outline (Characters)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]
Characters E. Wilde: Character Set Issues & Unicode

(4) Characters and Computers



Characters E. Wilde: Character Set Issues & Unicode

(5) Characters

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape […]

The Unicode Standard, Version 4.0, Addison-Wesley, 2003 [http://dret.net/biblio/reference/unicode4]



Characters E. Wilde: Character Set Issues & Unicode

(6) Glyphs

[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.

ISO/IEC 9541:1991, Information Technology – Font Information Interchange [http://dret.net/biblio/reference/iso9541]



Character Sets

Outline (Character Sets)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]
Character Sets E. Wilde: Character Set Issues & Unicode

(8) History of Character Sets



Character Sets E. Wilde: Character Set Issues & Unicode

(9) ASCII 1963

ASCII 1963

Character Sets E. Wilde: Character Set Issues & Unicode

(10) ASCII 1965

ASCII 1965

Character Sets E. Wilde: Character Set Issues & Unicode

(11) ASCII 1967

ASCII 1967

Character Sets E. Wilde: Character Set Issues & Unicode

(12) EBCDIC

EBCDIC

EBCDIC (1964)

Augmented EBCDIC

Augmented EBCDIC



Character Sets E. Wilde: Character Set Issues & Unicode

(13) Beyond ASCII



Character Sets E. Wilde: Character Set Issues & Unicode

(14) ISO 8859



Character Sets E. Wilde: Character Set Issues & Unicode

(15) ISO 8859-1 (Latin-1) & ISO 8859-2 (Latin-2)

ISO 8859-1 (Latin-1)

Latin-1 (Western European)

ISO 8859-2 (Latin-2)

Latin-2 (Central European)



Character Sets E. Wilde: Character Set Issues & Unicode

(16) ISO 8859-4 (Latin-4) & ISO 8859-5 (Cyrillic)

ISO 8859-4 (Latin-4)

Latin-4 (North European)

ISO 8859-5 (Cyrillic)

Cyrillic



Character Sets E. Wilde: Character Set Issues & Unicode

(17) ISO 8859-7 (Greek) & ISO 8859-15 (Latin-9)

ISO 8859-7 (Greek)

Greek

ISO 8859-15 (Latin-9)

Latin-9



Unicode Basics

Outline (Unicode Basics)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]
Unicode Basics E. Wilde: Character Set Issues & Unicode

(19) ISO 8859 Problems



Unicode Basics E. Wilde: Character Set Issues & Unicode

(20) Unicode



Unicode Basics E. Wilde: Character Set Issues & Unicode

(21) Unicode Character Count



Unicode Basics E. Wilde: Character Set Issues & Unicode

(22) BMP Structure

bmp.png

Roadmap of Unicode BMP. Each numbered box represents 256 characters. (Source: Wikipedia [http://en.wikipedia.org/wiki/Basic_Multilingual_Plane])

  •  Black  = Latin scripts and symbols
  •  Light Blue  = Linguistic scripts
  •  Blue  = Other European scripts
  •  Orange  = Middle Eastern and SW Asian scripts
  •  Light Orange  = African scripts
  •  Green  = South Asian scripts
  •  Purple  = Southeast Asian scripts
  •  Red  = East Asian scripts
  •  Light Red  = Unified CJK Han
  •  Yellow  = Canadian Aboriginal scripts
  •  Magenta  = Symbols
  •  Dark Grey  = Diacritics
  •  Light Grey  = UTF-16 surrogates and private use
  •  Cyan  = Miscellaneous characters
  •  White  = Unused


Unicode Basics E. Wilde: Character Set Issues & Unicode

(23) Unicode Encodings

AאU+233B4.gif
Code pointU+0041U+05D0U+597DU+233B4
UTF-841D7 90E5 A5 BDF0 A3 8E B4
UTF-1600 4105 D059 7DD8 4C DF B4
UTF-3200 00 00 4100 00 05 D000 00 59 7D00 02 33 B4


Unicode Basics E. Wilde: Character Set Issues & Unicode

(24) UTF-8



Unicode Basics E. Wilde: Character Set Issues & Unicode

(25) Other UTFs



Unicode Basics E. Wilde: Character Set Issues & Unicode

(26) Character Set Identification



Normalization and Transcoding

Outline (Normalization and Transcoding)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]
Normalization and Transcoding E. Wilde: Character Set Issues & Unicode

(28) Unicode is Complex

<français>français ≠ français</français>
<français>français ≠ français</français>


Normalization and Transcoding E. Wilde: Character Set Issues & Unicode

(29) Normalization Examples

Compatibility Composites

Normalization and Transcoding E. Wilde: Character Set Issues & Unicode

(30) Transcoding



Conclusions

Outline (Conclusions)

  1. Characters [3]
  2. Character Sets [10]
  3. Unicode Basics [8]
  4. Normalization and Transcoding [3]
  5. Conclusions [1]
Conclusions E. Wilde: Character Set Issues & Unicode

(32) Text is Complex



2008-09-25 Web Architecture [./]
Fall 2008 — INFO 290-03 (CCN 42584)