XML Basics

Database Management [./]
Fall 2012 — INFO 257

Erik Wilde, EMC IIG
2012-09-25

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: XML Basics

Contents

E. Wilde: XML Basics

(2) Abstract

The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD).



Foundations for XML

Outline (Foundations for XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
Foundations for XML E. Wilde: XML Basics

(4) Identifications



Unicode

Unicode E. Wilde: XML Basics

(6) XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode [http://www.unicode.org/], which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

<?xml version="1.0" encoding="UTF-8"?>
<JAPANESE>
 <TITLE>専門家リスト </TITLE>
 <ITEM>アシム・アブドゥラー氏(コマースネット事務局長)</ITEM>
 <ITEM>アラン・A・メッコラー氏(メッコラーメディア会長兼CEO)</ITEM>
 <ITEM>アラン・サルディッチ氏(メトリコムディレクター)</ITEM>
 <ITEM>ウィスター・ウォルコット氏(パイロットネットワーク・サービシズ副社長)</ITEM>
 <ITEM>・エリック・リンゲワルド氏(ビー・インク副社長)</ITEM>
 <ITEM>ジェームス・L・バークスデール氏(ネットスケープ・コミュニケーションズ社長)</ITEM>
</JAPANESE>
<?xml version="1.0" encoding="UTF-8"?>
<文書 改訂日付="1999年3月1日">
 <題>サンプル</題>
 <段落>これはサンプル文書です。</段落>
 <!-- コメント -->
 <段落>会社名</段落>
 <図面 図面実体名="サンプル" />
</文書>


Unicode E. Wilde: XML Basics

(7) XML and Unicode

  • XML is based on Unicode
    • XML is defined in terms of character structures [http://www.w3.org/TR/xml/#sec-starttags] (called markup)
    • how these characters are encoded is not part of XML
  • How are XML documents encoded?
    • applications can use any character encoding they like
    • XML processors must support UTF-8 and UTF-16
    • XML processors may support any number of additional encodings
  • How is the encoding encoded?
    • part of the XML document: <?xml version="1.0" encoding="UTF-8"?>
    • bootstrap problem solved heuristically or by out-of-band information


Uniform Resource Identifier (URI)

Outline (Uniform Resource Identifier (URI))

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
Uniform Resource Identifier (URI) E. Wilde: XML Basics

(9) Identifiers are Essential

  • Uniform Resource Locator (URL) is the old concept
    • introduced to distinguish between locating and naming
    • locating and naming are two ways of identification
    • URLs have been replaced by URIs, technically URLs do not exist anymore
  • URIs identify resources


XML

Outline (XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
XML E. Wilde: XML Basics

(11) XML Use Cases



XML Documents

Outline (XML Documents)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
XML Documents E. Wilde: XML Basics

(13) Markup?

  • Structures are encoded using special characters
    • a fundamental difference when comparing to binary formats
    • markup languages can be read and modified using text-based tools
    • programs must treat markup characters in a special way
  • Documents are content interspersed with markup (i.e., structures)
    • XML-aware software interprets the markup
    • XML-unaware software just sees a text file
    • modifications must be made XML-aware (e.g., inserting AT&T as AT&amp;T)
  • You have to pay the The Price for Markup [The Price for Markup (1)]


XML Documents E. Wilde: XML Basics

(14) Basic Concepts

  • XML Documents have an XML declaration (optional)
  • There is exactly one document element (a.k.a. root element)
  • Elements may be nested (there is no conceptual limit)
    • elements may be repeated (they can be identified by position)
  • Elements are marked up using tags
    • most elements have content, surrounded by start and end tags
    • empty elements are allowed and may use a special notation
  • Elements may have attributes (zero to any number)
    • attributes can only occur once on an element (i.e., they cannot be repeated)
<?xml version="1.0" encoding="UTF-8"?>
<element>
 <subelement attribute="value">Content</subelement>
 <subelement a2="value2">More Content</subelement>
 <empty-element a3="v3"></empty-element>
 <empty-element a4="v4" a5="v5"/>
</element>


XML Documents E. Wilde: XML Basics

(15) Tree Syntax

  • Markup is important, but only a notation
  • XML documents are trees with different node types
    • nodes so far: document, element, attribute, text
    XML document tree


XML Documents E. Wilde: XML Basics

(16) Elements

  • Elements can use a wide variety of names [http://www.w3.org/TR/xml/#NT-Name]
    • Allowed: html, id9832798472, _, :, こんにちは
    • Disallowed: leading numbers, spaces, control characters
  • Element names usually convey some information about the content
    • this is not reliable and highly language-dependent
    • it is very useful when working with a known vocabulary
    • it is potentially harmful when working with an unknown vocabulary
  • Elements are the foundation for XML's versatility
    • they can be nested (<address><city>Berkeley</city><zip>94709</zip>…)
    • they can be repeated (<givenname>Erik</givenname><givenname>Thomas</givenname>)
    • their sequence can convey additional information (given names have a sequence)


XML Documents E. Wilde: XML Basics

(17) Attributes

  • Additional information pertaining to elements
  • Traditionally, anything that is not considered content
    • SGML is a document markup language
    • XML uses SGML's concepts
    • XML has its roots in the document world
  • Elements: Content (i.e., Data); Attributes: Metadata
  • Documents often distinguish by what is textual content
 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
  </section>
 </section>


XML Documents E. Wilde: XML Basics

(18) Attribute Syntax

  • Naming rules are the same as for Elements [Elements (1)]
  • Attributes always appear within an element's start tag
  • Attributes are name/value-pairs [http://www.w3.org/TR/xml/#NT-Attribute]
    • the value is enclosed in single or double quotes
  • Attribute with a single-quote value: elem attr="Single: '"/
  • Attribute with a double-quote value: elem attr='Double :"'/
  • How can attribute values contain both?


XML Documents E. Wilde: XML Basics

(19) The Price for Markup

  • Markup characters have a special meaning
    • < opens a tag
    • for attribute values, quotes delimit the value
  • The literal use of a markup character requires escaping
    • XML's entities can refer to pieces of content
    • entity syntax is &name; for referring to the entity name
    • XML has 5 predefined entities [http://www.w3.org/TR/xml/#sec-predefined-ent]: &lt;, &gt;, &amp;, &apos;, &quot;
  • Attribute using both kinds of quotes: <elem attr="Single ' and Double &quot;"/>
<li>Attribute using both kinds of quotes: <code>&lt;elem attr="Single ' and Double &amp;quot;"/></code></li>


XML Documents E. Wilde: XML Basics

(20) Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements [http://www.w3.org/TR/xml/#sec-mixed-content]. What these elements do depends on the elements smiley.gif, but the important point is that they are on the same level as the text nodes of the mixed content.

<p>The term <em>Mixed content</em> in XML refers to elements <a href="http://www.w3.org/TR/xml/#sec-mixed-content">which have text content mixed with elements</a>. What these elements do depends on the elements <img style="height : 1em" src="smiley.gif"/>, but the important point is that they are on the same level as the text nodes of the mixed content.</p>
XML tree for mixed content

XML Documents E. Wilde: XML Basics

(21) Mixed Content Usage

  • Database people find mixed content irritating
    • cannot be easily mapped to relational structures
    • is more document-like than data-like
    • much harder to optimize for query analysis and query processing
  • Document people find mixed content very intriguing
    • textual content can still be used as simple text
    • markup provides additional information for rich text
    • start with a text-only document and use markup to add structure to it


Processing XML

Outline (Processing XML)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
Processing XML E. Wilde: XML Basics

(23) Observing XML Syntax

  • XML's syntax requires you to use the right characters
  • XML processors (a.k.a. XML parsers) check for these rules
    • if there are problems, the document cannot be interpreted as XML
    • otherwise, the document is said to be well-formed
  • Only well-formed documents can be regarded as a tree
    • other documents are not XML at all, even though they may be close
    • XML processors must report problems to the application (no silent recovery)


Processing XML E. Wilde: XML Basics

(24) Validity

  • Well-formed documents observe XML rules
    • they observe the XML syntax
    • they observe all well-formedness constraints
  • Applications require the right elements and attributes
  • Validity is a more comprehensive concept
  • Valid documents observe additional rules
    • they must be well-formed documents
    • they must adhere to the constraints defined in a DTD


Processing XML E. Wilde: XML Basics

(25) Semantics

  • XML is a language for encoding trees
    • Elements and attributes are labeled node in this tree
    • the labels can be chosen freely by document authors
  • The tree's meaning is nothing XML is concerned with
    • peers must have a mutual understanding of the semantics
    • XML without mutual understanding is almost useless
    • reverse engineering often is possible, but it is risky and brittle


Processing XML E. Wilde: XML Basics

(26) Document Object Model (DOM)

  • DOM is a tree-based API for accessing XML documents
    • the specification using a language-independent Interface Definition Language (IDL) [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/idl-definitions.html]
    • language bindings map IDL to specific languages such as Java [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/java-binding.html] or JavaScript [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/ecma-script-binding.html]
  • DOM is based on an in-memory representation of an XML document
  • DOM parsers have an additional layer for building the tree
    • an underlying SAX parser reports structures for tree building
    • the memory representation is heavily interlinked (requiring substantial memory)
    • DOM calls query or modify the memory representation of the tree
  • DOM processing is not appropriate for all tasks
    • very large documents may not fit into memory (risk of thrashing)
    • for isolated tasks, the parsing overhead is prohibitive


Processing XML E. Wilde: XML Basics

(27) DOM Parser

DOM Parser

Conclusions

Outline (Conclusions)

  1. Foundations for XML [4]
    1. Unicode [2]
    2. Uniform Resource Identifier (URI) [1]
  2. XML [15]
    1. XML Documents [9]
    2. Processing XML [5]
  3. Conclusions [1]
Conclusions E. Wilde: XML Basics

(29) XML Documents



2012-09-25 Database Management [./]
Fall 2012 — INFO 257