XML Basics

XML Foundations (INFOSYS 242)

Erik Wilde, UC Berkeley iSchool
Thursday, August 31, 2006
Creative Commons License

This work is licensed under a Creative Commons
Attribution-NonCommercial-ShareAlike 2.5 License.

Abstract

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).

Reminders

Outline (Foundations for XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

Identifications

Outline (Unicode)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode, which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

<?xml version="1.0" encoding="UTF-8"?>
<JAPANESE>
 <TITLE>専門家リスト </TITLE>
 <ITEM>アシム・アブドゥラー氏(コマースネット事務局長)</ITEM>
 <ITEM>アラン・A・メッコラー氏(メッコラーメディア会長兼CEO)</ITEM>
 <ITEM>アラン・サルディッチ氏(メトリコムディレクター)</ITEM>
 <ITEM>ウィスター・ウォルコット氏(パイロットネットワーク・サービシズ副社長)</ITEM>
 <ITEM>・エリック・リンゲワルド氏(ビー・インク副社長)</ITEM>
 <ITEM>ジェームス・L・バークスデール氏(ネットスケープ・コミュニケーションズ社長)</ITEM>
</JAPANESE>
<?xml version="1.0" encoding="UTF-8"?>
<文書 改訂日付="1999年3月1日">
 <題>サンプル</題>
 <段落>これはサンプル文書です。</段落>
 <!-- コメント -->
 <段落>会社名</段落>
 <図面 図面実体名="サンプル" />
</文書>

XML and Unicode

Outline (Uniform Resource Identifier (URI))

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

Identifiers are Essential

Outline (XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

XML Use Cases

Outline (XML Documents)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

Markup?

Basic Concepts

<?xml version="1.0" encoding="UTF-8"?>
<element>
 <subelement attribute="value">Content</subelement>
 <subelement a2="value2">More Content</subelement>
 <empty-element a3="v3"></empty-element>
 <empty-element a4="v4" a5="v5"/>
</element>

Tree Syntax

Elements

Attributes

 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
  </section>
 </section>

Attribute Syntax

The Price for Markup

<li>Attribute using both kinds of quotes: <code>&lt;elem attr="Single ' and Double &amp;quot;"/></code></li>

Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements. What these elements do depends on the elements , but the important point is that they are on the same level as the text nodes of the mixed content.

<p>The term <em>Mixed content</em> in XML refers to elements <a href="http://www.w3.org/TR/xml/#sec-mixed-content">which have text content mixed with elements</a>. What these elements do depends on the elements <img style="height : 1em" src="smily.gif"/>, but the important point is that they are on the same level as the text nodes of the mixed content.</p>

Mixed Content Usage

Whitespace

Significant Whitespace

Whitespace can be very important!

<p>Whitespace <i>can be</i> <u>very</u> <b>important</b>!</p>

Outline (Processing XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

Observing XML Syntax

Validity

Semantics

Outline (Conclusions)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [11]
    2. Processing XML [3]
  3. Conclusions [2]

XML Documents

XML Document Classes