XML Basics

XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)

Ray Larson, UC Berkeley School of Information
2011-08-30

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents R. Larson: XML Basics

Contents

R. Larson: XML Basics

(2) Abstract

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).



Foundations for XML

Outline (Foundations for XML)

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
Foundations for XML R. Larson: XML Basics

(4) Identifications



Unicode

Unicode R. Larson: XML Basics

(6) XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode [http://www.unicode.org/], which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

<?xml version="1.0" encoding="UTF-8"?>
<JAPANESE>
 <TITLE>専門家リスト </TITLE>
 <ITEM>アシム・アブドゥラー氏(コマースネット事務局長)</ITEM>
 <ITEM>アラン・A・メッコラー氏(メッコラーメディア会長兼CEO)</ITEM>
 <ITEM>アラン・サルディッチ氏(メトリコムディレクター)</ITEM>
 <ITEM>ウィスター・ウォルコット氏(パイロットネットワーク・サービシズ副社長)</ITEM>
 <ITEM>・エリック・リンゲワルド氏(ビー・インク副社長)</ITEM>
 <ITEM>ジェームス・L・バークスデール氏(ネットスケープ・コミュニケーションズ社長)</ITEM>
</JAPANESE>
<?xml version="1.0" encoding="UTF-8"?>
<文書 改訂日付="1999年3月1日">
 <題>サンプル</題>
 <段落>これはサンプル文書です。</段落>
 <!-- コメント -->
 <段落>会社名</段落>
 <図面 図面実体名="サンプル" />
</文書>


Unicode R. Larson: XML Basics

(7) XML and Unicode

  • XML is based on Unicode
    • XML is defined in terms of character structures [http://www.w3.org/TR/xml/#sec-starttags]
    • how these characters are encoded is not part of XML
  • How are XML documents encoded?
    • applications can use any character encoding they like
    • XML processors must support UTF-8 and UTF-16
    • XML processors may support any number of additional encodings
  • How is the encoding encoded?
    • part of the XML document: <?xml version="1.0" encoding="UTF-8"?>
    • bootstrap problem solved heuristically or by out-of-band information


Uniform Resource Identifier (URI)

Outline (Uniform Resource Identifier (URI))

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
Uniform Resource Identifier (URI) R. Larson: XML Basics

(9) Identifiers are Essential

  • Uniform Resource Locator (URL) is the old concept
    • introduced to distinguish between locating and naming
    • locating and naming are two ways of identification
    • URLs have been replaced by URIs, technically URLs do not exist anymore
  • URIs identify resources


Uniform Resource Identifier (URI) R. Larson: XML Basics

(10) URIs and REST

  • Representational State Transfer (REST) requires identification
    1. identify all relevant resources
    2. design/use representations for those resources
    3. resources should be linked (via URI) for allowing navigation
  • URIs are the API of RESTful applications
    • URI-identified resources have a uniform interface (HTTP)
    • interaction with those resources is done via HTTP
    • clients can navigate the resource/state space by following links


XML

Outline (XML)

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
XML R. Larson: XML Basics

(12) XML Use Cases



XML Documents

Outline (XML Documents)

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
XML Documents R. Larson: XML Basics

(14) Markup?

  • Structures are encoded using special characters
    • a fundamental difference when comparing to binary formats
    • markup languages can be read and modified using text-based tools
    • programs must treat markup characters in a special way
  • Documents are content interspersed with markup (i.e., structures)
    • XML-aware software interprets the markup
    • XML-unaware software just sees a text file
    • modifications must be made XML-aware (e.g., inserting AT&T as AT&amp;T)
  • You have to pay the The Price for Markup [The Price for Markup (1)]


XML Documents R. Larson: XML Basics

(15) Basic Concepts

  • XML Documents have an XML declaration (optional)
  • There is exactly one document element (a.k.a. root element)
  • Elements may be nested (there is no conceptual limit)
    • elements may be repeated (they can be identified by position)
  • Elements are marked up using tags
    • most elements have content, surrounded by start and end tags
    • empty elements are allowed and may use a special notation
  • Elements may have attributes (zero to any number)
    • attributes can only occur once on an element (i.e., they cannot be repeated)
<?xml version="1.0" encoding="UTF-8"?>
<element>
 <subelement attribute="value">Content</subelement>
 <subelement a2="value2">More Content</subelement>
 <empty-element a3="v3"></empty-element>
 <empty-element a4="v4" a5="v5"/>
</element>


XML Documents R. Larson: XML Basics

(16) Tree Syntax

  • Markup is important, but only a notation
  • XML documents are trees with different node types
    • nodes so far: document, element, attribute, text
    XML document tree


XML Documents R. Larson: XML Basics

(17) Elements

  • Elements can use a wide variety of names [http://www.w3.org/TR/xml/#NT-Name]
    • Allowed: html, id9832798472, _, :, こんにちは
    • Disallowed: leading numbers, spaces, control characters
  • Element names usually convey some information about the content
    • this is not reliable and highly language-dependent
    • it is very useful when working with a known vocabulary
    • it is potentially harmful when working with an unknown vocabulary
  • Elements are the foundation for XML's versatility
    • they can be nested (<address><city>Berkeley</city><zip>94709</zip>…)
    • they can be repeated (<givenname>Erik</givenname><givenname>Thomas</givenname>)
    • their sequence can convey additional information (given names have a sequence)


XML Documents R. Larson: XML Basics

(18) Attributes

  • Additional information pertaining to elements
  • Traditionally, anything that is not considered content
    • SGML is a document markup language
    • XML uses SGML's concepts
    • XML has its roots in the document world
  • Elements: Content (i.e., Data); Attributes: Metadata
  • Documents often distinguish by what is textual content
 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
  </section>
 </section>


XML Documents R. Larson: XML Basics

(19) Attribute Syntax

  • Naming rules are the same as for Elements [Elements (1)]
  • Attributes always appear within an element's start tag
  • Attributes are name/value-pairs [http://www.w3.org/TR/xml/#NT-Attribute]
    • the value is enclosed in single or double quotes
  • Attribute with a single-quote value: elem attr="Single: '"/
  • Attribute with a double-quote value: elem attr='Double :"'/
  • How can attribute values contain both?


XML Documents R. Larson: XML Basics

(20) The Price for Markup

  • Markup characters have a special meaning
    • < opens a tag
    • for attribute values, quotes delimit the value
  • The literal use of a markup character requires escaping
    • XML's entities can refer to pieces of content
    • entity syntax is &name; for referring to the entity name
    • XML has 5 predefined entities [http://www.w3.org/TR/xml/#sec-predefined-ent]: &lt;, &gt;, &amp;, &apos;, &quot;
  • Attribute using both kinds of quotes: <elem attr="Single ' and Double &quot;"/>
<li>Attribute using both kinds of quotes: <code>&lt;elem attr="Single ' and Double &amp;quot;"/></code></li>


XML Documents R. Larson: XML Basics

(21) Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements [http://www.w3.org/TR/xml/#sec-mixed-content]. What these elements do depends on the elements smiley.gif, but the important point is that they are on the same level as the text nodes of the mixed content.

<p>The term <em>Mixed content</em> in XML refers to elements <a href="http://www.w3.org/TR/xml/#sec-mixed-content">which have text content mixed with elements</a>. What these elements do depends on the elements <img style="height : 1em" src="smiley.gif"/>, but the important point is that they are on the same level as the text nodes of the mixed content.</p>
XML tree for mixed content

XML Documents R. Larson: XML Basics

(22) Mixed Content Usage

  • Database people find mixed content irritating
    • cannot be easily mapped to relational structures
    • is more document-like than data-like
    • much harder to optimize for query analysis and query processing
  • Document people find mixed content very intriguing
    • textual content can still be used as simple text
    • markup provides additional information for rich text
    • start with a text-only document and use markup to add structure to it


XML Documents R. Larson: XML Basics

(23) Whitespace

  • XML documents often are pretty-printed
  • Whitespace text nodes often are not really content
    • XML whitespace characters are space, tab, newline, and carriage return
    • whitespace text nodes are text nodes containing only whitespace characters
    XML tree with whitespace text nodes


XML Documents R. Larson: XML Basics

(24) Significant Whitespace

  • Some whitespace text nodes are relevant
  • Usually text nodes in mixed content elements

Whitespace can be very important!

<p>Whitespace <i>can be</i> <u>very</u> <b>important</b>!</p>
XML tree containing significant whitespace

Processing XML

Outline (Processing XML)

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
Processing XML R. Larson: XML Basics

(26) Observing XML Syntax

  • XML's syntax requires you to use the right characters
  • XML processors (a.k.a. XML parsers) check for these rules
    • if there are problems, the document cannot be interpreted as XML
    • otherwise, the document is said to be well-formed
  • Only well-formed documents can be regarded as a tree
    • other documents are not XML at all, even though they may be close
    • XML processors must report problems to the application (no silent recovery)


Processing XML R. Larson: XML Basics

(27) Validity

  • Well-formed documents observe XML rules
    • they observe the XML syntax
    • they observe all well-formedness constraints
  • Applications require the right elements and attributes
  • Validity is a more comprehensive concept
  • Valid documents observe additional rules


Processing XML R. Larson: XML Basics

(28) Semantics

  • XML is a language for encoding trees
    • Elements and attributes are labeled node in this tree
    • the labels can be chosen freely by document authors
  • The tree's meaning is nothing XML is concerned with
    • peers must have a mutual understanding of the semantics
    • XML without mutual understanding is almost useless
    • reverse engineering often is possible, but it is risky and brittle


Conclusions

Outline (Conclusions)

  1. Foundations for XML [5]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [2]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]
Conclusions R. Larson: XML Basics

(30) XML Documents



2011-08-30 XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)