XML Basics

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2007-09-04
Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).

Outline (Foundations for XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Identifications

Outline (Unicode)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode, which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

<?xml version="1.0" encoding="UTF-8"?>
<JAPANESE>
 <TITLE>専門家リスト </TITLE>
 <ITEM>アシム・アブドゥラー氏(コマースネット事務局長)</ITEM>
 <ITEM>アラン・A・メッコラー氏(メッコラーメディア会長兼CEO)</ITEM>
 <ITEM>アラン・サルディッチ氏(メトリコムディレクター)</ITEM>
 <ITEM>ウィスター・ウォルコット氏(パイロットネットワーク・サービシズ副社長)</ITEM>
 <ITEM>・エリック・リンゲワルド氏(ビー・インク副社長)</ITEM>
 <ITEM>ジェームス・L・バークスデール氏(ネットスケープ・コミュニケーションズ社長)</ITEM>
</JAPANESE>
<?xml version="1.0" encoding="UTF-8"?>
<文書 改訂日付="1999年3月1日">
 <題>サンプル</題>
 <段落>これはサンプル文書です。</段落>
 <!-- コメント -->
 <段落>会社名</段落>
 <図面 図面実体名="サンプル" />
</文書>

XML and Unicode

Outline (Uniform Resource Identifier (URI))

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Identifiers are Essential

Outline (XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

XML Use Cases

Outline (XML Documents)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Markup?

Basic Concepts

<?xml version="1.0" encoding="UTF-8"?>
<element>
 <subelement attribute="value">Content</subelement>
 <subelement a2="value2">More Content</subelement>
 <empty-element a3="v3"></empty-element>
 <empty-element a4="v4" a5="v5"/>
</element>

Tree Syntax

Elements

Attributes

 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
  </section>
 </section>

Attribute Syntax

The Price for Markup

<li>Attribute using both kinds of quotes: <code>&lt;elem attr="Single ' and Double &amp;quot;"/></code></li>

Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements. What these elements do depends on the elements smiley.gif, but the important point is that they are on the same level as the text nodes of the mixed content.

<p>The term <em>Mixed content</em> in XML refers to elements <a href="http://www.w3.org/TR/xml/#sec-mixed-content">which have text content mixed with elements</a>. What these elements do depends on the elements <img style="height : 1em" src="smiley.gif"/>, but the important point is that they are on the same level as the text nodes of the mixed content.</p>
XML tree for mixed content

Mixed Content Usage

Whitespace

Significant Whitespace

Whitespace can be very important!

<p>Whitespace <i>can be</i> <u>very</u> <b>important</b>!</p>
XML tree containing significant whitespace

Outline (Processing XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

Observing XML Syntax

Validity

Semantics

Outline (Conclusions)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [1]

XML Documents