XML Varia

XML Foundations [./]
Fall 2013 — INFO 242 (CCN 41613)

Erik WildeUC Berkeley School of Information, UC Berkeley School of Information
2013-12-09

[http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License [http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: XML Varia

Contents

2Abstract
1 Alternatives to XML
- 4XML's Capabilities
- 1.1 JavaScript Object Notation (JSON)
  - 6JSON is the new XML
  - 7JSON Example
- 1.2 Resource Description Framework (RDF)Semantic Web & Linked Data
  - 9Linked Data Principles
  - 10URIs as Names
  - 11RDF as Semantic Web Foundation
  - 12Semantic Web Data Model (abstract)
  - 13Semantic Web Data Model (URI concepts)
  - 14Linking Data Sets
2 Various XML Technologies
- 2.1 XML Information Set (XML Infoset)
  - 17What is the Content of an XML Document?
  - 18Infoset Example
  - 19What is Not in the Infoset
- 2.2 XML IDs with xml:id
  - 21References in Instances
- 2.3 XML Inclusions (XInclude)
  - 23Including Trees into Trees
  - 24XInclude Schema
- 2.4 Processing XML
  - 26XML and Programming
  - 27XML and Programming Languages
  - 28Typical XML & Programming Problem
  - 2.4.1 Simple API for XML (SAX)
    - 30Lightweight XML Processing
    - 31SAX Parser
  - 2.4.2 Document Object Model (DOM)
    - 33XML Trees Everywhere
    - 34DOM Parser
    - 35JDOM
  - 2.4.3 XML Data Binding
    - 37Mapping XML into Languages

E. Wilde: XML Varia

(2) Abstract

The first half of the lecture compares XML to alternatives, that also are used as ways to represent and/or manage and/or exchange and/or process data. The most relevant approaches in this space are RDF, JSON, and tabular/relational models such as SQL or NoSQL. One of the reasons why using XML-based approaches for data representation, management, interchange, and processing, is that there is a large landscape of existing standards and technologies and tools, and that for many problems it thus is possible to approach the problem by reusing existing solutions. In the second half of this lecture, we look at a small set of additional standards that were not yet covered.

Alternatives to XML

Outline (Alternatives to XML)

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

Alternatives to XML E. Wilde: XML Varia

(4) XML's Capabilities

If the only tool you have is a hammer, you tend to see every problem as a nail.

Abraham Maslow [http://en.wikipedia.org/wiki/Abraham_Maslow]

Representation (XML itself, XInclude)
Management (XDBMS, XQuery)
Interchange (sharing XML, REST)
Processing (XSLT, XQuery)

JavaScript Object Notation (JSON)

Outline (JavaScript Object Notation (JSON))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

JavaScript Object Notation (JSON) E. Wilde: XML Varia

(6) JSON is the new XML

XML's tree model is not natural for developers
- developers tend to think in objects or similar concepts
- mapping bidirectionally creates unhappiness and friction
JSON takes a minimalist approach at representing object structures
- the only supported concepts are objects (name/value pairs) [http://tools.ietf.org/html/rfc4627#section-2.2] and arrays [http://tools.ietf.org/html/rfc4627#section-2.3]
- these concepts can be nested as deeply as required
JSON's main reason for success is structural alignment
- most developers can work with JSON directly in their language
- XML's more sophisticated features often are not required
JSON is a poster child of the Pareto Principle [http://en.wikipedia.org/wiki/Pareto_principle] (or maybe 95/5)

JavaScript Object Notation (JSON) E. Wilde: XML Varia

(7) JSON Example

<?xml version="1.0"?>
<menu id="file" value="File">
 <popup>
  <menuitem value="New" onclick="CreateNewDoc()"/>
  <menuitem value="Open" onclick="OpenDoc()"/>
  <menuitem value="Close" onclick="CloseDoc()"/>
 </popup>
</menu>

menu.xml

{ "menu" : {
 "id" : "file",
 "value" : "File",
 "popup" : {
  "menuitem" : [
   { "value" : "New", "onclick" : "CreateNewDoc()" },
   { "value" : "Open", "onclick" : "OpenDoc()" },
   { "value" : "Close", "onclick" : "CloseDoc()" }
  ]
 }
}}

menu.json

Resource Description Framework (RDF)Semantic Web & Linked Data

Outline (Resource Description Framework (RDF))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

Resource Description Framework (RDF) E. Wilde: XML Varia

(9) Linked Data Principles

Use URIs as names for things.
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using a standard data model.
Include links to other URIs, so that people can discover more things.

Resource Description Framework (RDF) E. Wilde: XML Varia

(10) URIs as Names

A URI can identify any abstract concept, not just Web pages
The Semantic Web is mainly about how to assign URIs to any concept
DBpedia [http://dbpedia.org/] is a project creating URIs for every topic in Wikipedia
- Spoon: http://dbpedia.org/resource/Spoon_%28band%29 [http://dbpedia.org/resource/Spoon_%28band%29]
- Mysticism: http://dbpedia.org/resource/Mysticism [http://dbpedia.org/resource/Mysticism]
Anyone can create URIs for concepts
- Panthera tigris: http://lod.taxonconcept.org/ses/QMUrD [http://lod.taxonconcept.org/ses/QMUrD]
Multiple people or groups can create URIs for the same concept
- All the URIs for Panthera tigris [http://sameas.org/html?uri=http%3A%2F%2Flod.geospecies.org%2Fses%2FQMUrD]

Resource Description Framework (RDF) E. Wilde: XML Varia

(11) RDF as Semantic Web Foundation

The standard data model for Semantic Web standards is the Resource Description Framework (RDF)
Data in RDF is just lists of statements
- Spoon is a music group
- Spoon is named Spoon
- Spoon has a member Britt Daniel
A statement consists of three parts: [subject] [predicate] [object]
- [Spoon] [is a] [music group]
- [Spoon] [is named] ["Spoon"]
- [Spoon] [has a member] [Britt Daniel]

Resource Description Framework (RDF) E. Wilde: XML Varia

(12) Semantic Web Data Model (abstract)

Resource Description Framework (RDF) E. Wilde: XML Varia

(13) Semantic Web Data Model (URI concepts)

Resource Description Framework (RDF) E. Wilde: XML Varia

(14) Linking Data Sets

Various XML Technologies

Outline (Various XML Technologies)

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

XML Information Set (XML Infoset)

Outline (XML Information Set (XML Infoset))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

XML Information Set (XML Infoset) E. Wilde: XML Varia

(17) What is the Content of an XML Document?

An interesting (and fruitless) discussion
- the content is whatever you consider it to be
- agreement between peers is necessary for data exchange
- agreement between specification writers and toolmakers is necessary to provide tools
DOM and XSLT were two early arrivals
- both had an idea (and a model) of what the content of an XML document is
- they did not have the exact same idea
Set a normative standard for an XML document's content
- the Infoset defines what is represented in the tree
- people should be confident to get this information when using XML technologies

XML Information Set (XML Infoset) E. Wilde: XML Varia

(18) Infoset Example

XML Information Set (XML Infoset) E. Wilde: XML Varia

(19) What is Not in the Infoset

Do not rely on information not available in the Infoset [http://www.w3.org/TR/xml-infoset/#omitted]
- order of attributes
- type of quotes around attribute values
- notation of empty elements (<elem></elem> vs. <elem/>)
- how lines are terminated
- entities and character references
XML contains all this information if used as XML document
many XML technologies are in fact Infoset technologies
- XSD, XSLT, XQuery, SOAP, …

XML IDs with xml:id

Outline (XML IDs with xml:id)

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

XML IDs with xml:id E. Wilde: XML Varia

(21) References in Instances

DTD [Document Type Definition (DTD)] and XSD [XML Schema (XSD) – Part I] IDs are defined in the schema
- the important information is the (attribute) type in the schema
- the name of the attribute can be freely chosen by the schema designer
xml:id [http://www.w3.org/TR/xml-id/] establishes a well-know attribute name for IDs
- the IDness of an attribute is established by its name
- IDs can be found without any schema (there does not even has to be one)
xml:id uses XML's own namespace to identify identifiers
- only W3C-blessed specifications are allowed to extend the XML namespace
- http://www.w3.org/XML/1998/namespace [http://www.w3.org/XML/1998/namespace] contains a mix of various specs

<section id="introduction">

<section xml:id="introduction">

XML Inclusions (XInclude)

Outline (XML Inclusions (XInclude))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

XML Inclusions (XInclude) E. Wilde: XML Varia

(23) Including Trees into Trees

XInclude [http://www.w3.org/TR/xinclude/] defines an inclusion facility for XML
- it is a separate standard and not part of XML itself
- XML processors might or might not support XInclude
XInclude is defined as replacing XInclude elements with included content
- logically speaking, XInclude is thus defined as a tree transformation
- practically speaking, it can thus be implemented in XSLT [http://dret.net/projects/xipr/]

<x xmlns:xi="http://www.w3.org/2001/XInclude">
	<xi:include href="something.xml"/>
	<xi:include xpointer="xmlns(xi=http://www.w3.org/2001/XInclude)xpointer(x/xi:include[1])" parse="xml"/>
</x>

XML Inclusions (XInclude) E. Wilde: XML Varia

(24) XInclude Schema

<!ELEMENT xi:include (xi:fallback?)>
<!ATTLIST xi:include
	xmlns:xi		CDATA	   #FIXED	"http://www.w3.org/2001/XInclude"
	href			CDATA	   #IMPLIED
	parse		   (xml|text)  "xml"
	xpointer		CDATA	   #IMPLIED
	encoding		CDATA	   #IMPLIED
	accept		  CDATA	   #IMPLIED
	accept-language CDATA	   #IMPLIED
>

Processing XML

Outline (Processing XML)

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

Processing XML E. Wilde: XML Varia

(26) XML and Programming

XML is a format for structured data
- trees do not map very well to most programming languages
- for working with XML, some mapping into the language is required
There are two basic approaches for programming with XML:
1. use special functions to work on XML documents as external data objects
2. map XML documents to native data structures of the programming language
A third approach is to have an XML programming language
- XSLT [XML Transformations (XSLT) – Part I] is an example for an XML programming language
- XSD [XML Schema (XSD) – Part I] and XPath [XML Path Language (XPath)] become an integral part of Java in XJ [http://www.research.ibm.com/xj/]

Processing XML E. Wilde: XML Varia

(27) XML and Programming Languages

Most programming languages do not support XML natively
- a certain impedance mismatch between both models in unavoidable
Function libraries (or their equivalent) can provide XML processing facilities
- SAX [Simple API for XML (SAX) (1)] as an event-based API for accessing XML documents
- DOM [Document Object Model (DOM) (1)] as a tree-based API for accessing XML documents
Mapping between XML and the programming language can take two forms
- using hand-crafted code (based on XML functions) that performs the mapping
- generating code using an XML schema and/or target data structures in the language
Generating mapping code can be done in two ways
- using a generic XML Data Binding [XML Data Binding (1)] framework for the mapping
- using hand-crafted code that can be better tailored to the schemas

Processing XML E. Wilde: XML Varia

(28) Typical XML & Programming Problem

Asynchronous JavaScript and XML (Ajax) [../web-fall10/ajax] is based on HTTP & XML
- JavaScript code can communicate with the server using XMLHttpRequest [http://www.w3.org/TR/XMLHttpRequest/]
- in theory, the server sends XML data which is processed by the script
XML parsing and processing is inconvenient in JavaScript
- there is a impedance mismatch between JavaScript and XML
- if the client is slow and the XML is big, parsing can be time-consuming
- if all clients are JavaScript, then sending XML is not really necessary
JavaScript Object Notation (JSON) is a JavaScript-centric data model
- JavaScript code can directly instantiate JSON structures as runtime objects
- any non-JavaScript client (if there are any) will have to use JSON as well

Simple API for XML (SAX)

Outline (Simple API for XML (SAX))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

Simple API for XML (SAX) E. Wilde: XML Varia

(30) Lightweight XML Processing

SAX is an event-based API for accessing XML documents
SAX allows users to use event handlers for parsing-related events
- the parser reads a document and recognizes markup structures
- for each recognized structure, a user-supplied function can be called
SAX parsing requires little memory and can handle very large documents
- the breadth of the XML document tree is irrelevant to SAX parsing
- the depth of the tree is relevant for checking for well-formed documents
SAX parsing does not allow random access or backward movement
- saving context and history is something the application has to manage
- at a certain complexity, SAX parsing requires a lot of additional code

Simple API for XML (SAX) E. Wilde: XML Varia

(31) SAX Parser

SAX Parser

Document Object Model (DOM)

Outline (Document Object Model (DOM))

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

Document Object Model (DOM) E. Wilde: XML Varia

(33) XML Trees Everywhere

DOM is a tree-based API for accessing XML documents
- the specification using a language-independent Interface Definition Language (IDL) [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/idl-definitions.html]
- language bindings map IDL to specific languages such as Java [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/java-binding.html] or JavaScript [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/ecma-script-binding.html]
DOM is based on an in-memory representation of an XML document
- random document access using the tree's node structure [http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-1950641247]
- more specific tasks such as getting an element's attribute by name [http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-217A91B8]
DOM parsers have an additional layer for building the tree
- an underlying SAX parser reports structures for tree building
- the memory representation is heavily interlinked (requiring substantial memory)
- DOM calls query or modify the memory representation of the tree
DOM processing is not appropriate for all tasks
- very large documents may not fit into memory (risk of thrashing)
- for isolated tasks, the parsing overhead is prohibitive

Document Object Model (DOM) E. Wilde: XML Varia

(34) DOM Parser

DOM Parser

Document Object Model (DOM) E. Wilde: XML Varia

(35) JDOM

DOM is not optimized for a specific programming language
- DOM knowledge can be easily transferred between programming languages
- programming with DOM in a given language often is not very convenient
JDOM is a Java-specific version of a tree-based XML API
- represents the same concepts as DOM (XML structures)
- represents XML concepts in a more Java-friendly way [http://www-128.ibm.com/developerworks/java/library/j-jdom/#h2]
- JDOM has no relationship with the W3C's DOM API
JDOM can be built on top of almost any parser
- SAX is a pretty common choice for a foundation for JDOM
- SAX events are then used to build the JDOM tree

XML Data Binding

Outline (XML Data Binding)

Alternatives to XML [9]
1. JavaScript Object Notation (JSON) [2]
2. Resource Description Framework (RDF) [6]
Various XML Technologies [15]
1. XML Information Set (XML Infoset) [3]
2. XML IDs with xml:id [1]
3. XML Inclusions (XInclude) [2]
4. Processing XML [9]
  1. Simple API for XML (SAX) [2]
  2. Document Object Model (DOM) [3]
  3. XML Data Binding [1]

XML Data Binding E. Wilde: XML Varia

(37) Mapping XML into Languages

XML data binding connects XML with language-specific structures
- for OO languages this often means mapping schemas and classes
- code for serialization and deserialization can then be generated
Typical problems of data binding are schema changes
- if the schema is updated, can the code be migrated easily?
- can instances of different versions be handled by the same code?
- most data binding frameworks do not fully support XSD anyway
Several XML data binding frameworks are in widespread use
- Java Architecture for XML Binding (JAXB) [https://jaxb.dev.java.net/]
- Castor, another Java-based data binding framework

2013-12-09 XML Foundations [./]
Fall 2013 — INFO 242 (CCN 41613)