This work is licensed under a Creative Commons |
Every character-based document is based on some model of which characters are available, and how they are encoded. Unicode is the most popular character set today and provides a variety of encoding schemes, each of them being a Unicode Transformation Format (UTF). In addition to character sets and encodings, other issues relevant when dealing with characters are transcoding and normalization, which deal with the problems arising when using different character encodings or different encodings of particular characters.
atoms
language atoms
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape […]
The Unicode Standard, Version 4.0, Addison-Wesley, 2003
[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.
ISO/IEC 9541:1991, Information Technology – Font Information Interchange
I think there is a world market for maybe five computers.(¬ T. J. Watson)
It also shows the Euro sign € which is part of ISO 8859-15 (Latin-9), but not included in ISO 8859-1 (Latin-1).
It also shows the Euro sign ¤ which is part of ISO 8859-15 (Latin-9), but not included in ISO 8859-1 (Latin-1).
In Latin-9, Latin-1's currency symbol ¤ has been replaced with the Euro sign €.
U+0041
)XML is ASCII for the 21st century
planesof 216 = 65'536 characters
0to
16
Old Italic,
Deseret,
Byzantine Musical Symbols
astral planesis empty
Roadmap of Unicode BMP. Each numbered box represents 256 characters. (Source: Wikipedia) |
|
A | א | 好 | ||
---|---|---|---|---|
Code point | U+0041 | U+05D0 | U+597D | U+233B4 |
UTF-8 | 41 | D7 90 | E5 A5 BD | F0 A3 8E B4 |
UTF-16 | 00 41 | 05 D0 | 59 7D | D8 4C DF B4 |
UTF-32 | 00 00 00 41 | 00 00 05 D0 | 00 00 59 7D | 00 02 33 B4 |
Content-Type
header fieldContent-Type: text/html; charset=utf-8
<?xml version="1.0" encoding="utf-8"?>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
regularcharacters
<français>français ≠ français</français>
<français>français â français</français>