XML Varia

XML Foundations [./]
Fall 2013 — INFO 242 (CCN 41613)

Erik Wilde, UC Berkeley School of Information
2013-12-09

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: XML Varia

Contents

E. Wilde: XML Varia

(2) Abstract

The first half of the lecture compares XML to alternatives, that also are used as ways to represent and/or manage and/or exchange and/or process data. The most relevant approaches in this space are RDF, JSON, and tabular/relational models such as SQL or NoSQL. One of the reasons why using XML-based approaches for data representation, management, interchange, and processing, is that there is a large landscape of existing standards and technologies and tools, and that for many problems it thus is possible to approach the problem by reusing existing solutions. In the second half of this lecture, we look at a small set of additional standards that were not yet covered.



Alternatives to XML

Outline (Alternatives to XML)

  1. Alternatives to XML [9]
    1. JavaScript Object Notation (JSON) [2]
    2. Resource Description Framework (RDF) [6]
  2. Various XML Technologies [15]
    1. XML Information Set (XML Infoset) [3]
    2. XML IDs with xml:id [1]
    3. XML Inclusions (XInclude) [2]
    4. Processing XML [9]
      1. Simple API for XML (SAX) [2]
      2. Document Object Model (DOM) [3]
      3. XML Data Binding [1]
Alternatives to XML E. Wilde: XML Varia

(4) XML's Capabilities

If the only tool you have is a hammer, you tend to see every problem as a nail.

Abraham Maslow [http://en.wikipedia.org/wiki/Abraham_Maslow]



JavaScript Object Notation (JSON)

JavaScript Object Notation (JSON) E. Wilde: XML Varia

(6) JSON is the new XML

  • XML's tree model is not natural for developers
    • developers tend to think in objects or similar concepts
    • mapping bidirectionally creates unhappiness and friction
  • JSON takes a minimalist approach at representing object structures
    • the only supported concepts are objects (name/value pairs) [http://tools.ietf.org/html/rfc4627#section-2.2] and arrays [http://tools.ietf.org/html/rfc4627#section-2.3]
    • these concepts can be nested as deeply as required
  • JSON's main reason for success is structural alignment
    • most developers can work with JSON directly in their language
    • XML's more sophisticated features often are not required
  • JSON is a poster child of the Pareto Principle [http://en.wikipedia.org/wiki/Pareto_principle] (or maybe 95/5)


JavaScript Object Notation (JSON) E. Wilde: XML Varia

(7) JSON Example

<?xml version="1.0"?>
<menu id="file" value="File">
 <popup>
  <menuitem value="New" onclick="CreateNewDoc()"/>
  <menuitem value="Open" onclick="OpenDoc()"/>
  <menuitem value="Close" onclick="CloseDoc()"/>
 </popup>
</menu>
{ "menu" : {
 "id" : "file",
 "value" : "File",
 "popup" : {
  "menuitem" : [
   { "value" : "New", "onclick" : "CreateNewDoc()" },
   { "value" : "Open", "onclick" : "OpenDoc()" },
   { "value" : "Close", "onclick" : "CloseDoc()" }
  ]
 }
}}


Resource Description Framework (RDF)Semantic Web & Linked Data

Resource Description Framework (RDF) E. Wilde: XML Varia

(9) Linked Data Principles

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using a standard data model.
  4. Include links to other URIs, so that people can discover more things.


Resource Description Framework (RDF) E. Wilde: XML Varia

(10) URIs as Names



Resource Description Framework (RDF) E. Wilde: XML Varia

(11) RDF as Semantic Web Foundation

  • The standard data model for Semantic Web standards is the Resource Description Framework (RDF)
  • Data in RDF is just lists of statements
    • Spoon is a music group
    • Spoon is named Spoon
    • Spoon has a member Britt Daniel
  • A statement consists of three parts: [subject] [predicate] [object]
    • [Spoon] [is a] [music group]
    • [Spoon] [is named] ["Spoon"]
    • [Spoon] [has a member] [Britt Daniel]


Resource Description Framework (RDF) E. Wilde: XML Varia

(12) Semantic Web Data Model (abstract)

rdf-triple.png

Resource Description Framework (RDF) E. Wilde: XML Varia

(13) Semantic Web Data Model (URI concepts)

rdf-triple-uris.png

Resource Description Framework (RDF) E. Wilde: XML Varia

(14) Linking Data Sets

lod-cloud.png

Various XML Technologies

Outline (Various XML Technologies)

  1. Alternatives to XML [9]
    1. JavaScript Object Notation (JSON) [2]
    2. Resource Description Framework (RDF) [6]
  2. Various XML Technologies [15]
    1. XML Information Set (XML Infoset) [3]
    2. XML IDs with xml:id [1]
    3. XML Inclusions (XInclude) [2]
    4. Processing XML [9]
      1. Simple API for XML (SAX) [2]
      2. Document Object Model (DOM) [3]
      3. XML Data Binding [1]

XML Information Set (XML Infoset)

XML Information Set (XML Infoset) E. Wilde: XML Varia

(17) What is the Content of an XML Document?

  • An interesting (and fruitless) discussion
    • the content is whatever you consider it to be
    • agreement between peers is necessary for data exchange
    • agreement between specification writers and toolmakers is necessary to provide tools
  • DOM and XSLT were two early arrivals
    • both had an idea (and a model) of what the content of an XML document is
    • they did not have the exact same idea
  • Set a normative standard for an XML document's content
    • the Infoset defines what is represented in the tree
    • people should be confident to get this information when using XML technologies


XML Information Set (XML Infoset) E. Wilde: XML Varia

(18) Infoset Example

infoset-example.png

XML Information Set (XML Infoset) E. Wilde: XML Varia

(19) What is Not in the Infoset

  • Do not rely on information not available in the Infoset [http://www.w3.org/TR/xml-infoset/#omitted]
    • order of attributes
    • type of quotes around attribute values
    • notation of empty elements (<elem></elem> vs. <elem/>)
    • how lines are terminated
    • entities and character references
  • XML contains all this information if used as XML document
  • many XML technologies are in fact Infoset technologies
    • XSD, XSLT, XQuery, SOAP, …


XML IDs with xml:id

XML IDs with xml:id E. Wilde: XML Varia

(21) References in Instances

  • DTD [Document Type Definition (DTD)] and XSD [XML Schema (XSD) – Part I] IDs are defined in the schema
    • the important information is the (attribute) type in the schema
    • the name of the attribute can be freely chosen by the schema designer
  • xml:id [http://www.w3.org/TR/xml-id/] establishes a well-know attribute name for IDs
    • the IDness of an attribute is established by its name
    • IDs can be found without any schema (there does not even has to be one)
  • xml:id uses XML's own namespace to identify identifiers
<section id="introduction">
<section xml:id="introduction">


XML Inclusions (XInclude)

XML Inclusions (XInclude) E. Wilde: XML Varia

(23) Including Trees into Trees

  • XInclude [http://www.w3.org/TR/xinclude/] defines an inclusion facility for XML
    • it is a separate standard and not part of XML itself
    • XML processors might or might not support XInclude
  • XInclude is defined as replacing XInclude elements with included content
    • logically speaking, XInclude is thus defined as a tree transformation
    • practically speaking, it can thus be implemented in XSLT [http://dret.net/projects/xipr/]
<x xmlns:xi="http://www.w3.org/2001/XInclude">
	<xi:include href="something.xml"/>
	<xi:include xpointer="xmlns(xi=http://www.w3.org/2001/XInclude)xpointer(x/xi:include[1])" parse="xml"/>
</x>


XML Inclusions (XInclude) E. Wilde: XML Varia

(24) XInclude Schema

<!ELEMENT xi:include (xi:fallback?)>
<!ATTLIST xi:include
	xmlns:xi		CDATA	   #FIXED	"http://www.w3.org/2001/XInclude"
	href			CDATA	   #IMPLIED
	parse		   (xml|text)  "xml"
	xpointer		CDATA	   #IMPLIED
	encoding		CDATA	   #IMPLIED
	accept		  CDATA	   #IMPLIED
	accept-language CDATA	   #IMPLIED
>


Processing XML

Processing XML E. Wilde: XML Varia

(26) XML and Programming

  • XML is a format for structured data
    • trees do not map very well to most programming languages
    • for working with XML, some mapping into the language is required
  • There are two basic approaches for programming with XML:
    1. use special functions to work on XML documents as external data objects
    2. map XML documents to native data structures of the programming language
  • A third approach is to have an XML programming language
    • XSLT [XML Transformations (XSLT) – Part I] is an example for an XML programming language
    • XSD [XML Schema (XSD) – Part I] and XPath [XML Path Language (XPath)] become an integral part of Java in XJ [http://www.research.ibm.com/xj/]


Processing XML E. Wilde: XML Varia

(27) XML and Programming Languages

  • Most programming languages do not support XML natively
    • a certain impedance mismatch between both models in unavoidable
  • Function libraries (or their equivalent) can provide XML processing facilities
    • SAX [Simple API for XML (SAX) (1)] as an event-based API for accessing XML documents
    • DOM [Document Object Model (DOM) (1)] as a tree-based API for accessing XML documents
  • Mapping between XML and the programming language can take two forms
    • using hand-crafted code (based on XML functions) that performs the mapping
    • generating code using an XML schema and/or target data structures in the language
  • Generating mapping code can be done in two ways
    • using a generic XML Data Binding [XML Data Binding (1)] framework for the mapping
    • using hand-crafted code that can be better tailored to the schemas


Processing XML E. Wilde: XML Varia

(28) Typical XML & Programming Problem

  • Asynchronous JavaScript and XML (Ajax) [../web-fall10/ajax] is based on HTTP & XML
    • JavaScript code can communicate with the server using XMLHttpRequest [http://www.w3.org/TR/XMLHttpRequest/]
    • in theory, the server sends XML data which is processed by the script
  • XML parsing and processing is inconvenient in JavaScript
    • there is a impedance mismatch between JavaScript and XML
    • if the client is slow and the XML is big, parsing can be time-consuming
    • if all clients are JavaScript, then sending XML is not really necessary
  • JavaScript Object Notation (JSON) is a JavaScript-centric data model
    • JavaScript code can directly instantiate JSON structures as runtime objects
    • any non-JavaScript client (if there are any) will have to use JSON as well


Simple API for XML (SAX)

Simple API for XML (SAX) E. Wilde: XML Varia

(30) Lightweight XML Processing

  • SAX is an event-based API for accessing XML documents
  • SAX allows users to use event handlers for parsing-related events
    • the parser reads a document and recognizes markup structures
    • for each recognized structure, a user-supplied function can be called
  • SAX parsing requires little memory and can handle very large documents
    • the breadth of the XML document tree is irrelevant to SAX parsing
    • the depth of the tree is relevant for checking for well-formed documents
  • SAX parsing does not allow random access or backward movement
    • saving context and history is something the application has to manage
    • at a certain complexity, SAX parsing requires a lot of additional code


Simple API for XML (SAX) E. Wilde: XML Varia

(31) SAX Parser

SAX Parser

Document Object Model (DOM)

Document Object Model (DOM) E. Wilde: XML Varia

(33) XML Trees Everywhere

  • DOM is a tree-based API for accessing XML documents
    • the specification using a language-independent Interface Definition Language (IDL) [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/idl-definitions.html]
    • language bindings map IDL to specific languages such as Java [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/java-binding.html] or JavaScript [http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/ecma-script-binding.html]
  • DOM is based on an in-memory representation of an XML document
  • DOM parsers have an additional layer for building the tree
    • an underlying SAX parser reports structures for tree building
    • the memory representation is heavily interlinked (requiring substantial memory)
    • DOM calls query or modify the memory representation of the tree
  • DOM processing is not appropriate for all tasks
    • very large documents may not fit into memory (risk of thrashing)
    • for isolated tasks, the parsing overhead is prohibitive


Document Object Model (DOM) E. Wilde: XML Varia

(34) DOM Parser

DOM Parser

Document Object Model (DOM) E. Wilde: XML Varia

(35) JDOM

  • DOM is not optimized for a specific programming language
    • DOM knowledge can be easily transferred between programming languages
    • programming with DOM in a given language often is not very convenient
  • JDOM is a Java-specific version of a tree-based XML API
    • represents the same concepts as DOM (XML structures)
    • represents XML concepts in a more Java-friendly way [http://www-128.ibm.com/developerworks/java/library/j-jdom/#h2]
    • JDOM has no relationship with the W3C's DOM API
  • JDOM can be built on top of almost any parser
    • SAX is a pretty common choice for a foundation for JDOM
    • SAX events are then used to build the JDOM tree


XML Data Binding

XML Data Binding E. Wilde: XML Varia

(37) Mapping XML into Languages

  • XML data binding connects XML with language-specific structures
    • for OO languages this often means mapping schemas and classes
    • code for serialization and deserialization can then be generated
  • Typical problems of data binding are schema changes
    • if the schema is updated, can the code be migrated easily?
    • can instances of different versions be handled by the same code?
    • most data binding frameworks do not fully support XSD anyway
  • Several XML data binding frameworks are in widespread use


2013-12-09 XML Foundations [./]
Fall 2013 — INFO 242 (CCN 41613)