<a href="./">XML Foundations (INFOSYS 242)</a>

<a href="./">XML Foundations (INFOSYS 242)</a> Erik Wilde UC Berkeley iSchool Fall Semester 2006 2006 Erik Wilde

</h1> <h3><title level="xslidy"/></h3> <h5><author/>, <affiliation/><br/><date/></h5> <a rel="license" title="view full text of license" href="http://creativecommons.org/licenses/by-nc-sa/2.5/" class="bottom-align" style="margin-bottom : 2%"> <table> <tr> <td align="left"> <img alt="Creative Commons License" border="0" src="somerights20.png" height="31" width="88"/> </td> <td style="font-size : small ; line-height : 120%;" valign="middle" align="left"> <p>This work is licensed under a Creative Commons<br/>Attribution-NonCommercial-ShareAlike 2.5 License.</p> </td> </tr> </table> </a> </slide> <class> </class> </layout> <style type="text/css" src="xslidy-fall06.css"/> <index name="index.html"> <category element="xml" class="xml"/> <category element="elem" class="xml elem"/> <category element="cssp" class="css"/> <category element="csss" class="css"/> <category element="css" class="css"/> <category element="xpathf" class="xpath"/> <category element="xpath" class="xpath"/> <category element="xslte" class="xslt elem"/> <category element="xslta" class="xslt"/> <category element="xslt" class="xslt"/> <category element="xsde" class="xsd elem"/> <category element="xsda" class="xsd"/> <category element="xsd" class="xsd"/> </index> <toc id="html-toc" name="toc.html"> <table rules="all" cellspacing="0" cellpadding="5" width="100%"> <thead> <tr> <th>Date</th> <th>Subject</th> <th>Slides</th> <th>Required Reading</th> <th>Resources</th> </tr> </thead> <tbody> <for-each-presentation> <tr> <td align="right" valign="top"><date/></td> <td><b><title/><span class="toggle">:</span></b> <span class="toggle"><span class="abstract"><toc id="abstract"/></span></span></td> <td align="center"><presentation-link title="Lecture Slides"><title form="short"/></presentation-link> <slides>(* Slides)</slides></td> <td><toc id="reading"/></td> <td><toc id="resources"/></td> </tr> </for-each-presentation> </tbody> </table> </toc> <toc id="sylvia" name="242.xml"> <course xmlns="urn:publicid:IDN+www.sims.berkeley.edu:schema:syllabusapp:syllabus:200404:en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:publicid:IDN+www.sims.berkeley.edu:schema:syllabusapp:syllabus:200404:en syllabus_schema.xsd"> <generalInformation> <title>XML Foundations 2 http://dret.net/lectures/xml-fall06/ SIMS INFOSYS 242 2006 F 2006-08-29 2006-10-19 Professor EW Erik Wilde dret@sims.berkeley.edu Office +1-510-6432253 http://dret.net/netdret/ Tu 15:30:00 16:30:00 Th 15:30:00 16:30:00 314 South Hall TA KL Katrina Rhoads Lindholm krhoads@sims.berkeley.edu http://ischool.berkeley.edu/~krhoads/ M 12:30:00 14:00:00 210 South Hall PNP Ray Learning XML, 2nd Edition Erik T. Ray O'Reilly September 2003 0-596-00420-6 Required

Three hours of lecture, one hour of Laboratory per week. The Extensible Markup Language (XML), with its ability to define formal structural and semantic definitions for metadata and information models, is the key enabling technology for information services and document-centric business models that use the Internet and its family of protocols. This course introduces XML syntax, styles and transformations, and schema languages. It balances conceptual topics with practical skills for designing and implementing conceptual models as XML schemas.

LEC Tu 14:00:00 15:30:00 Th 14:00:00 15:30:00 110 South Hall

<xslidy:title/>

Required

Lecture Notes

Assignment 1: Getting Started with XML and XML EditorsUngraded2006-09-052006-09-12This assignment introduces you to XML in the context of XML Spy or the oXygen XML Editor. You don't have to turn anything in. However, you should use this opportunity to get comfortable with one of the two editors, as you will be using them for the rest of the semester, and maybe the rest of your life.http://dret.net/lectures/xml-fall06/a/1Sample Fileshttp://www.dret.net/lectures/xml-fall06/a/1/a1.zipAssignment 2: Résumé XML and DTDUngraded2006-09-072006-09-14In this assignment, you will take a sample résumé and create an XML representation of it. You will also create a DTD that can be used to validate your XML document, and other résumés that are structurally similar to yours.http://dret.net/lectures/xml-fall06/a/2Tor Landheim's XML
Tor Landheim's DTDSample Résuméhttp://www.dret.net/lectures/xml-fall06/a/2/SampleResume.pdfAssignment 3: CSSUngraded2006-09-142006-09-19Create a Cascading Style Sheet (CSS) for a simple HTML document. The HTML contains simple structural markup and some additional classes which should be used for creating formatting specific to these contents.http://dret.net/lectures/xml-fall06/a/3Directory of Submitted CSS's
CSS SlideshowHTML Documenthttp://www.dret.net/lectures/xml-fall06/a/3/SampleHTML.htmlCSS Zen Gardenhttp://www.csszengarden.css/WDG HTML Referencehttp://www.htmlhelp.org/reference/html40/w3schools.com CSS Referencehttp://www.w3schools.com/css/Assignment 4: XPath and NamespacesUngraded2006-09-212006-09-26Answer a set of questions about XPath and Namespaces.http://dret.net/lectures/xml-fall06/a/4SolutionsXML Document to use for XPath Evaluationshttp://www.dret.net/lectures/xml-fall06/a/4/dret.xmlPretty (and Complete) HTML Version of the above XML Documenthttp://dret.net/biblio/w3schools.com XPath Referencehttp://www.w3schools.com/xpath/Assignment 5: XML to HTML TransformationUngraded2006-09-262006-10-03Create an XML file with your personal resume information and then transform it into HTML using XSLT.http://dret.net/lectures/xml-fall06/a/5w3schools.com XPath Referencehttp://www.w3schools.com/xpath/w3schools.com XSLT Referencehttp://www.w3schools.com/xslAssignment 6: DTD to SchemaUngraded2006-10-052006-10-12Convert your résumé DTD to an XML Schema. In addition to the simple way of moving from DTD syntax to XML Schema syntax, we also require you to improve the schema, so that is a better schema than the DTD (because it is more selective in what it validates).http://dret.net/lectures/xml-fall06/a/6w3schools.com XML Schema Referencehttp://www.w3schools.com/schema/Assignment 7: XML to XML Transformation with CSS on generated HTMLGraded2006-10-172006-10-26In this final assignment, you will have the opportunity to utilize many of the XML skills you have learned throughout the course. These include XML, XML Schema, XPath, XSLT and CSS.http://dret.net/lectures/xml-fall06/a/7Resume Schemahttp://www.dret.net/lectures/xml-fall06/a/7/resume.xsdXSL for converting to HTMLhttp://www.dret.net/lectures/xml-fall06/a/7/resume.xslw3schools.com XML Schema Referencehttp://www.w3schools.com/schema/w3schools.com XPath Referencehttp://www.w3schools.com/xpath/w3schools.com XSLT Referencehttp://www.w3schools.com/xsl 2006-07-18 dret Overview and Introduction Tuesday, August 29, 2006 XML 1.0 Press Release The Extensible Markup Language (XML) has been introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was considered as conveying not enough semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that since its invention has been taken to higher levels with the idea of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data. Abstract

Varia About Me

Apprenticeship at Hahn-Meitner-Institut Berlin (HMI) (85-88)
Computer Science at Technical University of Berlin (TUB) (88-91)

working on DAPHNE, an SGML-based document preparation system

Ph.D. at ETH Zürich (92-97)

thesis on Group and Session Management for Collaborative Applications

Post-Doc at ICSI, Berkeley (97/98)

book on Technical Foundations of the World Wide Web

Various activities back in Switzerland (98-06)

teaching at ETH Zürich and FHNW
working as independent consultant (training, courses, consulting)
research in various XML-related areas
starting and leading the ShaRef project

About You

About this Course

Course Web page: http://dret.net/lectures/xml-fall06/
Course mailing list: subscribe at majordomo@sims.berkeley.edu

no subject (leave blank)
body of message: subscribe i242

Grading is offered pass/fail only
Lab times have to be negotiated today or thursday

Tuesday 11:00-12:30
Wednesday 12:30-14:00
Wednesday 16:00-17:30
Thursday 11:00-12:30

About these Slides

Generated from XSLidy XML

all Slidy presentations are generated from this source
242.xml for importing the syllabus into SylViA
toc.html for displaying the summary on the course's Web page

Designed for online presentation and use (lots of links!)

for printing, use a (all slides), and s (smaller font) a couple of times

A good real-world example for XML applications

XSLidy is useful, but there is no interface (XML editing only)
SylViA is useful, but there is no interface (XML editing or XSLidy export)
SylViA is over-modeled in some areas and too monolithic
UCB-wide management of course material and syllabi would be great

Additional Resources

My Online Glossary at http://dret.net/glossary/

suggestions, updates, corrections are very welcome
another exercise in how to use XML and XSLT for information management

My bibliography at http://dret.net/biblio/

suggestions, updates, corrections are very welcome
produced by an XML-centric system for managing bibliography data

The World Wide Web Consortium (W3C)

the organization which invented XML
as well as (almost) all other technologies covered in this course

Why XML? Web Technologies

Early Web: URI+HTTP+HTML

URIs identify resources (in a human-readable way)
HTTP retrieves resources (using a simple protocol)
HTML is the resource format (using a simple data format)

The early Web was a distributed hypermedia system

not designed by hypermedia researchers or companies
simple enough to be adopted very fast

The Web today uses many different technologies

URI+HTTP+HTML for basic Web publishing
CSS & JavaScript (maybe even AJAX) for advanced publishing

JavaScript & XML (a.k.a. AJAX)

scripts dynamically loading data from a server
machine-to-machine interaction: the server and the script

From Humans to Machines

The Web was designed for humans

HTML is a language for describing page layout and links
machines were only used for implementing it

Search engines were the first machine users on the Web

they made the Web's success possible
they demonstrated how hard it is to understand HTML pages
search engines are still a very active field of research

A bigger Web needs more automation

Pre-XML Problems HTML is for Humans

HTML is a format for dead ends

HTML is good for rendering Web pages
HTML is bad for understanding Web pages
the browser is a dead end (from a machine's point of view)

Web growth in the late 90's was enormous

everybody was putting information online
but this information was inaccessible for machines

How can this information be made accessible to machines?

HTML is not the right format (slightly better than fax machines)
there was no other widely accepted format for structured data

A Machine-Friendly Web

Information should be published in a machine-understandable format

HTML is good for rendering Web pages
HTML is bad for understanding Web pages
understanding is the key term here: application semantics!

Information should be published in application-specific formats

HTML is one application: Rendering documents for humans
machines need other structures to process Web content

1996: W3C Working Group SGML on the Web

HTML is just one document type defined with SGML
SGML is a very complex and expensive technology
how can SGML be made easily and widely usable?

XML on the Web SGML, HTML, and XML

Standard Generalized Markup Language (SGML)

a language for designing document types
a very complex standard with many expensive and non-interoperable implementations

Hypertext Markup Language (HTML)

implements a simple SGML document type
its syntax is SGML syntax, it is not defined by HTML itself
uses very few SGML features, dedicated processors are rather easy to build

Extensible Markup Language (XML)

a language for designing document types (i.e., classes of documents)
a greatly simplified version of SGML, omitting many obscure features
a specification with no optional parts!

XML Documents on the Web

XML's idea was that content should be published as XML

stylesheets could then be used to render human-readable views
machines could simply use the underlying XML

There are (almost) no XML documents on the Web

stylesheet support depends on browsers (software has a long life!)
many content providers do not want to publish machine-readable data

There are many XML documents behind HTML documents

content does not have to be made public in a machine-readable way
browser-independent HTML can be produced from XML
XML technologies can be leveraged on the server-side

XML Documents Elsewhere

XML is not used as intended, but it is very successful

as a server-side foundation for Web publishing
as a B2B-focused format with no Web publishing in mind

XML has been successful because of different reasons

being there at the right time (Internet bubble)
politically correct (the W3C is OS-agnostic)
technically sound (simple and no optional parts)
human-readable based on a well-known syntax
great for rapid prototyping and experiments

XML Today Used Everywhere

Very small: Messages from sensors

e.g., building automation or car electronics
mostly implemented in hardware or firmware

Very large: Genome sequences

encoding the results of genome analyses
yields very large XML documents (several gigabytes)

Very different processing requirements

very fast processing (time critical applications)
memory-conserving processing (very large documents)
incremental processing (streaming)
random access (only small part required)

This Course and XML

XML is the ASCII for the 21^st century

information professionals should know and use XML
you will see it in many projects
you will hopefully use it in many projects
you will be able to build and test prototypes very rapidly

What do you need for using XML?

XML and some kind of schema language
XSLT for processing it

What is XML? XML Ying & Yang

XML is:

great for exchanging trees (if this is what you want to do)
platform-independent (even your mobile phone processes XML)
a foundation for other technologies (some of which we will look at)

XML is not:

a programming language (ever programmed comma-separated values?)
capturing semantics (without higher-layer consensus, XML is worthless)
ensuring interoperability (we both use bits! we can interoperate!)

What is XML Good for? Why Use XML?

Because you want to share data

share it in a format which is widely used and easy to use
enable others to use it on various platforms with existing tools

Because you want to share data cheaply

it is easier to use XML than to invent something new
it is even easier to use an existing XML schema than to invent a new one

Because you want to share data openly

if you invent new formats, people must process them
avoid applying the security through obscurity principle inadvertently
application-specific processing should be deferred to higher layers

Case Study

Managing bibliographic data in universities is a problem

very different cultures (law vs. computer science)
different tool sets (programs, operating systems, habits)
many potential uses (yearly reports, CV, departmental web site)

ShaRef (Shared References) is trying to solve this problem

XML-based and open data model
lossless import and export for important data formats
easy integration into existing IT landscapes

XML helps in processing and exchanging data

XML data can be processed with many tools and technologies
XML can be used on many different platforms
for XML to be used, a well-defined data model is necessary

Pre-XML Data

Handcrafted syntax is hard to understand and process
Parsing requires a specialized parser
Character set issues can become very complicated

XMLized Data (Bad Idea) XMLized Data

Structurally identical to the original BibTeX document
Can be processed with XML tools and technologies
Character set issues still unsolved (needs mapping table)

XML Data

Well-defined and well-documented data model
Reusable in different contexts
Still some open issues (e.g., no concept of author identity)

Other XML Data

Data using other schemas can be easily derived
Both formats must be understood to make the mapping
Some things may be missing (e.g., translated titles)

Is XML Self-Describing?

XML is often said to be self-describing

many people think this is the same as self-explanatory
the catch is what exactly it is you refer to by describing

Database data cannot live without a database

database data is simply content, the structure is provided by a DBMS
XML documents have their structure encoded within them
compared to database data, XML in fact is self-describing

What is the gap between self-describing and self-explanatory?

it is impossible to find out how the document could be modified
there are no semantics associated with neither structure nor content
so self-describing means, you can guess a lot, but you maybe wrong

What is XML not Good for? XML is Character-Based

XML is not a binary format, it is based on Unicode

binary formats cannot (or rather should not) be described using XML

Multimedia formats often are binary

image formats such as GIF, JPEG, and PNG
audio formats such as MP3 and AAC
video formats such as MPEG4 and H.264

But: Multimedia also uses many XML formats

vector graphics formats such as Scalable Vector Graphics (SVG)
Synchronized Multimedia Integration Language (SMIL) for describing presentations

XML is a Syntax for Trees

Not all data is easily represented by trees

overlapping markup (multiple views of the same content)
graph-like structures which are less constrained than trees

What is it that you have in your tree?

XML encodes a structure purely on the syntactic level
what the structures mean is in no way described by XML
XML structures must be accompanied by semantic descriptions

XML Usages

XML can be used in different ways

people should be able to use your XML directly using standard tools
if they absolutely need a set of special tools, something is wrong

XML is hip, so everybody wants to use it

many things have been created ad-hoc and without much planning
if you start something which is XML-based, use XML responsibly
if you have to use some bad XML, complain about it

Finding the balance can be hard

XML is great for prototyping and experiments
once you decide to redesign your XML, it may be too late
XML documents may be short-lived, XML schemas are definitely not

Beyond XML Sharing Concepts

XML is a syntax for trees

trees are just data
for doing something useful, you must understand the trees content's

Schema-based sharing of concepts is possible

HTML works great because everybody is using it
Anything beyond HTML's capabilities need a new schema

General sharing of concepts is hard

the AI community tried for decades and failed
micro-formats are a more humble approach to reusable shared concepts

The Semantic Web

Technologies for describing concepts

the foundation of successful interaction is mutual understanding
describe your XML using Semantic Web technologies

XML core technologies do not convey any meaning

XML is a language for exchanging trees
XML schema languages describe what trees may be exchanged
XML schema languages are for markup design

Semantic Web technologies have received a lot of attention

and a lot of research funding
success for the most general approaches is highly questionable
proven failure as demonstrated by AI's failure
modest approaches are much more promising and likely to succeed

Conclusions What's the Plan?

XML Basics and how to apply them
Describing classes of XML documents
How to control the presentation of XML documents
Combining different vocabularies of XML documents
Selecting parts of an XML document
Transforming XML into something else (or XML again)
A more complicated way to describe classes of XML documents
Even more ways of describing classes of XML documents
How does all of this relate to databases?
What to expect as future developments

XML Basics Thursday, August 31, 2006 Chapters 1.3 (pp. 16-28) & 2.1-2.4 (pp. 49-66) W3C's XML Specification The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD). Abstract

Reminders

Attendance is mandatory
Course mailing list: subscribe at majordomo@sims.berkeley.edu

no subject (leave blank)
body of message: subscribe i242

Lab time needs to be renegotiated

Monday 11.00-15.00 (earlier is better)
Tuesday 9.30-12.30 (later is better)
Wednesday 9.30-12.30 (later is better)
Thursday 11.00-12.30

Foundations for XML Identifications

Identification of Character Encodings

text can be encoded using different character sets and encodings
IANA maintains the official list of character encodings

Identification of Languages

textual content should be tagged with language information
specification based on ISO 639 language tags

Unicode XML's Idea of Content and Names

XML documents can use a wide array of characters. They are defined by Unicode, which currently (Version 5.0) defines more than 100'000 characters (#100'000 added in 2005).

XML and Unicode

XML is based on Unicode

XML it defined in terms of character structures
how these characters are encoded is not part of XML

How are XML documents encoded?

applications can use any character encoding they like
XML processors must support UTF-8 and UTF-16

How is the encoding encoded?

part of the XML document: <?xml version="1.0" encoding="UTF-8"?>
bootstrap problem solved heuristically or by out-of-band information

Uniform Resource Identifier (URI) Identifiers are Essential

Uniform Resource Locator (URL) is the old concept

introduced to distinguish between locating and naming
locating and naming are two ways of identification
URLs have been replaced by URIs, technically URLs do not exist anymore

URIs identify resources

some resources may be retrieved using a protocol: http://dret.net/netdret/
not all resource access is retrieval: mailto:dret@ischool.berkeley.edu
sometimes computers are not required: tel:+1-510-6432253
or resources cannot be located: urn:ietf:rfc:2648

XML XML Use Cases

XML is a metalanguage supporting application-specific vocabularies
RSS (and Atom) are XML vocabularies for newsfeeds

Doc or Die: RSS feed vs. Atom feed
browsers now incorporate newsfeed readers

OpenDocument (ODF) is a language for office application documents

designed for open and interoperable exchange
standardized by ISO (which now also standardizes Microsoft's Open XML)

Scalable Vector Graphics (SVG) for portable vector graphics

designed for embedding in Web pages
good example for compound documents: HTML containing SVG

XML Documents Markup?

Structures are encoded using special characters

a fundamental difference when comparing to binary formats
markup languages can be read and modified using text-based tools
programs must treat markup characters in a special way

Documents are content interspersed with markup (i.e., structures)

XML-aware software interprets the markup
XML-unaware software just sees a text file

You have to pay the

Basic Concepts

XML Documents have an XML declaration (optional)
There is exactly one document element (a.k.a. root element)
Elements may be nested (there is no conceptual limit)

elements may be repeated (they can be identified by position)

Elements are marked up using tags

most elements have content, surrounded by start and end tags
empty elements are allowed and may use a special notation

Elements may have attributes (zero to any number)

attributes can only occur once on an element (i.e., they cannot be repeated)

Tree Syntax

Markup is important, but only a notation
XML documents are trees with different node types

nodes so far: document, element, attribute, text

Elements

Elements can use a wide variety of names

Allowed: html, _, :, id9832798472, こんにちは
Disallowed: leading numbers, spaces, control characters

Element names usually convey some information about the content

this is not reliable and highly language-dependent
it is very useful when working with a known vocabulary
it is potentially harmful when working with an unknown vocabulary

Elements are the foundation for XML's versatility

they can be nested (<address><city>Berkeley</city><zip>94709</zip>...)
they can be repeated (<givenname>Erik</givenname><givenname>Thomas</givenname>)
their sequence can convey additional information (given names have a sequence)

Attributes

Additional information pertaining to elements
Traditionally, anything that is not considered content

SGML is a document markup language
XML uses SGML's concepts
XML has its roots in the document world

Elements: Content (i.e., Data); Attributes: Metadata
Documents often distinguish by what is textual content

Attribute Syntax

Naming rules are the same as for
Attributes always appear within an element's start tag
Attributes are name/value-pairs

the value is enclosed in single or double quotes

Attribute with a single-quote value: elem attr="Single: '"/
Attribute with a double-quote value: elem attr='Double :"'/
How can attribute values contain both?

The Price for Markup

Markup characters have a special meaning

< opens a tag
within attribute values, quotes delimit the value

The literal use of a markup character requires escaping

XML's entities can refer to pieces of content
entity syntax is &name; for referring to entity name
XML has 5 predefined entities: <, >, &, ', "

Attribute using both kinds of quotes: <elem attr="Single ' and Double ""/>

Attribute using both kinds of quotes: <elem attr="Single ' and Double &quot;"/>]]>

Mixed Content

The term Mixed content in XML refers to elements which have text content mixed with elements. What these elements do depends on the elements , but the important point is that they are on the same level as the text nodes of the mixed content.

The term Mixed content in XML refers to elements which have text content mixed with elements. What these elements do depends on the elements

, but the important point is that they are on the same level as the text nodes of the mixed content.

]]>

Mixed Content Usage

Database people find mixed content irritating

cannot be easily mapped to relational structures
is more document-like than data-like

Document people find mixed content very intriguing

textual content can still be used as simple text
markup provides additional information for rich text

Whitespace

XML documents often are pretty-printed
Whitespace text nodes often are not really content

XML whitespace characters are space, tab, newline, and carriage return
whitespace text nodes are text nodes containing only whitespace characters

Significant Whitespace

Some whitespace text nodes are relevant
Usually text nodes in mixed content elements

Whitespace can be very important!

Whitespace can be very important!
]]>

XML tree containing significant whitespace

Processing XML Observing XML Syntax

XML's syntax requires you to use the right characters

the grammar alone allows many XML error
additional constraints ensure that everything is used correctly

XML processors (a.k.a. XML parsers) check for these rules

if there are problems, the document cannot be interpreted as XML
otherwise, the document is said to be well-formed

Only well-formed documents can be regarded as a tree

other documents are not XML at all, even though they may be close
XML processors must report problems to the application (no silent recovery)

Validity

Well-formed documents observe XML rules

they observe XML syntax
they observe all well-formedness constraints

Applications require the right elements and attributes
Validity is a more comprehensive concept
Valid documents observe additional rules

they must be well-formed documents
they must adhere to the constraints defined in a

Semantics

XML is a language for encoding trees

Elements and attributes are labeled node in this tree
the labels can be chosen freely by document authors

The tree's meaning is nothing XML is concerned with

peers must have a mutual understanding of the semantics
XML without mutual understanding is almost useless
reverse engineering often is possible, but it is risky and brittle

Conclusions XML Documents

XML documents are structured data using markup
Elements and Attributes are the main structuring mechanisms
Elements and Attributes are names, but have no inherent semantics
For using XML successfully, shared semantics are essential

XML Document Classes

XML define classes of documents
Elements and Attributes and their usage can be defined
By validating documents, their structural correctness can be checked
DTDs solve a small part of checking XML for semantic integrity

Document Type Definition (DTD) Tuesday, September 5, 2006 Chapter 4-4.2 (pp. 108-132) XML QuickRef The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application. Abstract

Schema Languages XML Validation

XML knows two states of documents, well-formed and valid
well-formed documents satisfy all basic constraints of the XML specification

they can be parsed according to the XML grammar
they satisfy the additional constraints (e.g., start and end tags match)
together, this means they can be translated into a tree

valid documents have been validated against a DTD

a document must be well-formed before it can be validated
all elements and attributes have been defined
elements and attributes are used according to their definition

Validation and Applications

Non-XML, Well-Formed, and Valid DTD Example

The DTD defines constraints on element and attribute usage
The DTD does only in part constrain textual contents

XML Schema Languages

DTDs are part of XML itself

XML specifies the document format and one schema language
DTD support is provided by most XML processors (validating processors)

Other schema languages are available

XML Schema as the W3C's recommendation
as a rule-based alternative
various other research projects and products

Choosing appropriate schema language(s) is important

we look at DTDs because they are part of XML itself
we look at XML Schema because it is widely used
we look at Schematron because it is simple and powerful
you may even invent your own schema language (a.k.a. meta-programming)

DTD Basics XML is SGML light

XML is a subset of SGML

XML documents have been greatly simplified
XML DTDs have retained more of SGML's peculiarities

DTD design should be left to XML experts

simple DTDs (for prototypes) are easy to define (or generate)
serious DTDs for complex data models are hard to define

XML is a useful tool for experiments and prototypes

basic knowledge of DTDs is required
serious XML schemas often use XML Schema anyway

Connecting Documents and DTDs

A DTD is a schema for a set of documents

there may be just one document for a DTD, there may be billions (HTML)
in most cases, DTDs are managed as a separate resource

The Document Type Declaration contains or points to markup declarations that provide a grammar for a class of documents

the part which is contained is called Internal Subset
the part which is pointed to is called External Subset
internal and external subset together are the Document Type Definition (DTD)

External subsets are identified by Public and System Identifiers

public identifiers use a special notation
system identifiers are URIs (relative or absolute)
applications use (i.e., know or retrieve) the DTD for validation

DTD Syntax DTDs are not XML Documents

DTDs use a special syntax

somewhat ironic when everything else is XMLized
DTDs cannot be processed with standard XML tools
more compact than XML syntax

Definition of elements and attribute lists

elements are defined by the content they allow
attribute lists are sets of allowed attributes on elements

Syntax Rules

There is no container containing the whole DTD

<!ELEMENT xml EMPTY> thus is a complete DTD

Definitions (officially called declarations) use <!... > syntax

ELEMENT is used to define an element
ATTLIST is used to define an attribute list
ENTITY is used to define an entity

The document element is not marked explicitly

but it must be declared in the document type declaration
this means the document element is defined by the document, not by the DTD

Defining Elements Element Only Content

Element content is defined by a grammar for the children

sequences are indicated with a comma: ,
choices are indicated with a vertical bar: |
optional parts are indicated with a question mark: ?
repeatable parts are indicated with a plus: +
optional and repeatable parts are indicated with a asterisk: *
parentheses can be used for grouping and nesting

Mixed Content

allows text content and elements to be mixed

characters are allowed in (this must not be declared)
for non-whitespace characters, character data must be allowed explicitly

The allowed child elements may be constrained, but not their order or their number of occurrences
Mixed Content always is defined as <!ELEMENT x (#PCDATA | a | b | ...)* >

character only content is a special case of mixed content

the element may only contain characters (no other elements)
the repetition is not necessary because there is no choice

Empty Content

Empty elements can be useful

they may contain all information in attributes
their presence may carry semantics without the need for additional information

Defining Attribute Lists Attributes belong to Elements

Attributes are specified in an element's Attribute List

an element definition may have any number of attributes associated with it
attributes may occur at most once on an element

Attributes definitions have a name, a type, and a default declaration

the attribute appears according to the default declaration
if the attribute is present, its value must conform to the type

Attribute Types

Attribute values can be constrained (which is not possible for element content)

CDATA means any character string (but no markup)
enumerated types list allowed values: (data|ref|object) (list of XML names)
ID for identifying elements (part of )
IDREF for referencing identified elements (part of )

Application-oriented attribute types are often simulated

using , modeling information can be preserved

The default declaration specifies the attribute's presence

#REQUIRED means the attribute has to be specified (on every element)
#IMPLIED marks an optional attribute (the parser may imply a value)
"..." specifies a default value (and the attribute is optional)

Advanced DTDs ID/IDREF References in Documents

Without Validation, there are no IDs

ID is an attribute type declared in the DTD
xml:id is an attempt to support schema-independent IDs

IDs are used to assign identities to elements

the XML processor reports duplicate IDs as errors (part of validation)

IDREFs are used to reference existing IDs

the XML processor reports references to non-existing IDs as errors (part of validation)

IDs must be XML Names (in particular, they may not start with a number)

ID/IDREF in a Document References within the Tree

Formatting Example

XSLidy can generate links to sections such as the section about , this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.

]]>

After running XSLidy, the following HTML is generated:

XSLidy can generate links to sections such as the section about ID/IDREF, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.

]]>

ID/IDREF Semantics

Rooted in the document world

all parts are assembled before processing
names are symbolic and assigned as required
mixed syntax and semantics

Good idea, but many shortcomings

constraints apply to one document only
IDs and IDREFs are global instead of scoped
identifiers should be allowed to use any type
identifier processing should be type-specific (2 ≟ 02)

Applications must know how to process ID/IDREF

for HTML export, links can be generated
for databases, keys should be used

Entities General Entities

XML's core concept of physical data structures

an entity is a named unit of data which can be referenced
within documents, it is referenced by the markup &entity-name;

Entities can be used to name and reuse document content

Character References look like entities: ☺ or ☺ = ☺

they can be used to represent any Unicode character, they are processed as single characters

Parameter Entities

Parameter entities are parsed entities for use within the DTD

a parameter entity must be specifically declared as such
within DTDs, it is referenced by the markup %entity-name;
outside of DTDs, parameter entities cannot be used

As general entities, parameter entities are meant for reuse

in a DTD, reuse is mostly about reusing structures
parameter entities are DTDs duct tape, not elegant, but effective

XHTML Parameter Entities (Attributes) XHTML Parameter Entities (Content) More Advanced DTDs Additional Mechanisms

DTDs have more advanced mechanisms

used in few applications, mostly by SGML veterans
should not be used in new projects

Conditional Sections for configurable DTDs

parts of a DTD can be enclosed in special constructs
based on parameter entity setting, these parts can be switched on or off

External Entities for referencing external resources

parsed entities contain content parsed by the XML processor
inclusion should be done with XInclude
unparsed entities contain non-XML content (e.g., images or plain text)
referring to non-XML content is handled on the application level

Conclusions DTD for XML Schemas

XML documents are processed by applications
Applications have assumptions about XML documents
DTDs allow to formalize some of these constraints
Part of the constraint checking must still be programmed

Modeling DTDs

Data models can be mapped to many different DTDs
What is a good DTD? What is a bad DTD?
How does the DTD affect further processing?

The Good, the Bad, and the Ugly Thursday, September 7, 2006 Chapter 3-3.4 (pp. 18-25) On XML Language Design While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies. Abstract

XML Best Practices

Good: What you should do when using XML
Bad: What you should not do when using XML
Ugly¹: What you maybe have to do when using XML
Ugly²: XML's ugly little secret...

XML Best Practices Markup and Schemas

XML can be encountered in different ways

as somebody having to process XML documents
as somebody having to understand XML documents
as somebody having to generate XML documents
as somebody having to design XML schemas

XML Documents Generating XML

Character encoding

use one of XML's standard encodings (UTF-8 or UTF-16)
if you are using mostly latin characters, UTF-8 is much more compact
any other character encoding may cause interoperability issues

Pretty-printing (adding line feeds and indentation)

pretty-printed XML is easier to read for humans
pretty printed XML contains unnecessary whitespace
pretty-printing is good for experiments and prototypes
pretty printing should be switched off for production systems

XML Views

Other people may use different tools

XML is a character-based formats, so every character counts
other people may choose different technologies
even your XML editor may choose to see things differently

Many XML technologies use abstractions

useful for concentrating on the tree view
no full control of markup usage (automatic serialization)
think about working with a tree rather than working with a text file

XML DTDs From Model to Markup

There should be a conceptual model of the data

formal conceptual models for XML are an active field of research
informal models may use any notation

Model design should omit questions of markup design

element/attribute decisions are not a model question
hierarchy/reference decisions are not a model question
identifying the relevant entities and their relationships is a good idea

Document engineering never invented modeling tools

for document modelers, the markup is the model
there are no established notations for modeling documents
document-type parts (e.g., mixed content) are hard to include in models

From Graphs to Trees

In the model, n:m relationships may appear

in an address database, an address should be reusable
in a résumé, an organization's information should be reusable

XML documents are trees

all non-tree structures must be represented by tree structures
in most cases, this will be done by introducing references

From Markup to Model

Start with a sample instance

start with a sample instance
generate a schema for the instance with some tool
open up the schema where necessary
try creating more example instances as different as possible/required
write code for manipulating your test set of instances

Restarting may be hard, but should be done

view the initial design as a test bed, not as the first version
after you have learned some lessons, throw everything away
restart by designing everything from scratch
content may be salvaged by writing small XSLT programs

Top-Down or Bottom-Up?

Both strategies have strengths and shortcomings

top-down tends to result in markup which looks generated
bottom-up tends to result in markup which is less consistent

Consistency is an important consideration

if you dislike attributes, avoid them wherever possible
if you like attributes, use them wherever possible
don't mix these two styles of markup design

Reuse is Good

Elements can be reused in different contexts

elements then appear in the content model of more than one element
an address may be used for employee as well as for customer

Content can be reused in different contexts

(parts of) a content model may be useful in different contexts
this only reuses an element's content, but not its name

Attributes can be reused in different contexts

technically, attributes are element-specific and have no relations when appearing on different elements
when reusing attribute names, they should represent the same concept

Reuse is Hard (in DTDs)

Element reuse simply lists the element in more than one content model
Content reuse requires parameter entities
Attribute reuse requires parameter entities
Nested parameter entities for multi-level reuse

General XML Issues Element vs. Attribute

Elements and attributes are containers

both contain character content

Elements may carry attributes and may contain other elements

for nested structures, elements must be chosen
if the content needs to be annotated with an attribute, an element must be chosen
if the item should be repeatable, an element must be chosen

Attributes use less markup and have types

if the content is (unstructured) metadata, an attribute may be a good choice
for special types (ID/IDREF and enumerations), attributes are required
if simple markup is an issue, attributes may be preferable

Be consistent in you markup design style!

Hierarchy vs. Reference

Hierarchies are only possible with 1:n relationships

for n:m relationships, references are the only possible representation

Containment should be represented as hierarchy

containment limits the lifetime of the contained part to that of the container

Granularity

XML structures should identify the relevant information

what exactly means relevant?
very high granularity makes data acquisition hard
very high granularity makes data processing easy

Granularity is a general problem of data modeling

XML is simply a syntax for representing structured data
<phone>+1-510-6432253</phone>
<phone cc="1" area="510" local="6432253"/>

Bad XML Consistent Markup

Decide on a strategy and stick to it
Inconsistent markup is hard to work with
Do not try to use markup itself for data representation

attribute values in single quotes should be ignored
empty elements using empty element tags have a special meaning

Simple Markup

XML can be read and edited by hand

this depends on the application scenario and markup design
human-accessible XML should be a markup design goal

Tool requirements

if your documents can only be used with tool xyz, something is wrong
XML should be used for open data formats in open environments

Undocumented side-effects

data models may include more dependencies than encoded in the schema
clearly document these side-effects so that users are warned
if possible, document them in a machine readable way using a schema language

Ugly XML Redundant Data

Redundant data is bad

database design emphasizes normalization to eliminate redundant data
normalization is difficult, creates complex structures, and makes data access slower
real-life models and databases always contain redundancies

Redundant data is used very frequently

the ZIP code identifies state and city/cities
very few address databases normalize street names (or numbers)

Redundancy can be used for error detection/correction

Redundancy in the Schema

Redundant data in schemas is very bad

schema inspection cannot reveal the same objective behind the same markup
further schema development will introduce inconsistencies

Redundant data in schemas should be avoided

schemas are a small and well-designed dataset
schema design and maintenance are important issues

Generically Generated Markup

Some XML designers generate their schemas

generated schemas are more likely to be not very well-designed
the schema generation process may be poorly implemented

Some schemas are based on a very generic markup

the structure actually is in the content, not in the markup
XML tools will not be very useful when working with these documents

XML Information Set (XML Infoset) What is the Content of an XML Document?

An interesting (and fruitless) discussion

the content is whatever you consider it to be
agreement between peers is necessary for data exchange
agreement between specification writers and toolmakers is necessary to provide tools

DOM and XSLT were two early arrivals

both had an idea (and a model) of what the content of an XML document is
they did not have the exact same idea

Set a normative standard for an XML document's content

the Infoset defines what is represented in the tree
people should be confident to get this information when using XML technologies

What is <u>Not</u> in the Infoset

Do not rely on information not available in the Infoset

order of attributes
type of quotes around attribute values
notation of empty elements (<elem></elem> vs. <elem/>)
how lines are terminated
entities and character references

XML contains all this information if used as XML document
many XML technologies are in fact Infoset technologies

XML Schema, XSLT, XQuery, SOAP, ...

Conclusions XML and Modeling

XML is about representing structured data
XML is a format for representing trees
Data models often are not trees
Mapping data models to trees can be done in many ways

Assignment

Assignment 2 is a simple Modeling task

we provide a sample instance and some requirements
create an XML version of the sample instance
create a DTD which is more versatile than just working for the sample instance

Cascading Style Sheets (CSS) Tuesday, September 12, 2006 Chapter 5 (pp. 164-204) W3C CSS Specs; W3C CSS Validator Cascading Stylesheets (CSS) have been designed as a language for better separating presentation-specific issues from the structuring of documents as provided by HTML. However, CSS can be applied to XML as well, either directly (by applying a CSS stylesheet to an XML document), or as an supplement to basic HTML layout structures generated from an XML document. CSS uses a simple model of selectors and declarations. Selectors specify to which elements of a document a set of declarations (each being a value assigned to a property) apply; in addition there is a model of how property values are inherited and cascaded. The biggest limitation of CSS is that it cannot change the structure of the displayed document. Abstract

Why CSS? Structure vs. Layout

HTML started as very simple layout-oriented structures

more layout control was introduced as attributes (align, color)
HTML became increasingly polluted by layout information

CSS was introduced as a format for layout information

the HTML can be kept simple, containing only the structures
layout information can be reused by using separate CSS files

CSS has been designed for HTML

it has been generalized to also cover XML
the HTML heritage is still very visible in CSS

HTML vs. XML HTML vs. XML

HTML's built-in formatting rules can be expressed by CSS

CSS has been extended to cover all HTML formatting
any element can be defined to be formatted like an HTML element

What's Missing?

Restructuring content

CSS assigns formatting properties to elements
the document tree which is formatted cannot be restructured
parts can be ignored or new parts can be inserted

Interpreting content

img has a lot of special meanings attached
all form elements have very special semantics

How CSS Works CSS in Action HTML and CSS

CSS specifies how HTML elements are formatted

formatting can be attached to every element (redundant inside document)
formatting can be included in document (redundant across documents)
separate CSS files describe the formatting (best reuse)

Any combination of these methods is possible

XML and CSS

HTML has special elements & attributes

link and style as header elements
the style attribute on all body elements

XML has no fixed set of elements or attributes

it would have been possible to define a special CSS namespace
instead, it was decided to have a processing instruction for making the connection

The CSS then must select the elements

Formatting Model

Properties are central to the CSS formatting model

create a document tree
identify the media type (e.g., screen or print)
retrieve all stylesheets required for the media type
assign values to all properties in the document tree
generate a formatting structure (a different tree)
render the formatting structure on the target medium

Properties control the rendering of elements
Styling in CSS means assigning values to properties

Properties Formatting Instructions

Properties define how elements are formatted

they define a specific facet of formatting
they may have interdependencies with other properties
they can be assigned explicitly
they may be defined through or

A property has a name and is used in a name/value-pair

the name identifies the property that is being set
the value space depends on the property
some properties accept complex values
sets of values: p { font : bold italic large Palatino }
sequences of values: p { font-family : "Segoe UI", verdana, helvetica, arial, sans-serif }

Property specifications can be grouped

.thinboxed { border-width : 1px ; padding : 10px ; margin : 5px }

CSS1 Properties Factoring out HTML

CSS1 was published in 1996 and revised in 1999
HTML suffered from too many attributes

layout information was specified as CSS
style attributes in HTML were marked as deprecated

A small set of formatting features as CSS properties

font: p { font : 80% sans-serif }
color and background: body { background : url(logo.jpeg) right top }
text: h1 { text-transform : uppercase }
box: p.quote { border-style : solid dotted }
classification: img { display : none }

CSS2 Properties CSS2

CSS2 was published in 1998 and is still being revised (CSS2¹)
CSS2¹ is what you can expect from modern browsers

with IE (even IE7) being the exception

CSS2 is a single and coherent specification

CSS3 is a jungle of concurrent module development
CSS3 will never be finished (some modules will, though)

Generated Content

CSS1 had no way of adding information to the document

by using display, parts of the document could be ignored

Generated content allows content to come from the CSS

only possible with :before and :after pseudo-elements
static strings: p.abstract:before { content : "Abstract: " }
special effects like quotes: q:before { content : open-quote }
counters: h1:before { content: "Chapter " counter(chapter) ". " ; counter-increment : chapter }

Quotes can be defined as being language dependent

q:lang(en) { quotes : '"' '"' "'" "'" }
q:lang(no) { quotes : "«" "»" '"' '"' }

Tables

CSS1 did not address table formatting

table layout still had to be done using HTML attributes
a lot of redundant code specifying cell alignment and borders

CSS2 introduced tables on the CSS level

table    { display: table }
tr       { display: table-row }
thead    { display: table-header-group }
tbody    { display: table-row-group }
tfoot    { display: table-footer-group }
col      { display: table-column }
colgroup { display: table-column-group }
td, th   { display: table-cell }
caption  { display: table-caption }

Fixed vs. Automatic Table Layout

HTML defines a complex table rendering algorithm

tables are rendered incrementally
table layout is determined by looking at the complete table

Automatic

Fixed

col 1 row 1	col 2 row 1 col 2 row 1	col 3 row 1 col 3 row 1 col 3 row 1
col 1 row 2	col 2 row 2 col 2 row 2	col 3 row 2 col 3 row 2 col 3 row 2

col 1 row 1	col 2 row 1 col 2 row 1	col 3 row 1 col 3 row 1 col 3 row 1
col 1 row 2	col 2 row 2 col 2 row 2	col 3 row 2 col 3 row 2 col 3 row 2

Clipping of contents allows more freedom

HTML tables are designed to show everything
many applications work better when table contents are clipped

CSS3 Properties CSS3

CSS3 is modularized and huge
Developments for different applications and scenarios

under construction for some time to come
implementations have to wait until the modules are more stable

CSS3 contains many powerful features

more powerful features mean a higher fall when fallback behavior occurs
CSS3 modules will probably undergo evolutionary selection and mutation

Multi-Column Layout

Most column-based layouts use tables

table columns are filled left-to-right, then top-to-bottom
multicolumns allow content to flow between columns

Multicolumn layouts are used in many Web pages
Publishing tools are good at hiding the table problems

Selectors Select and Style

are applied to elements

properties can be directly applied in an element's style attribute
in all other cases, selectors are used to select the styled elements

Selectors are good for reusable CSS code

identifying the most appropriate formatting classes is not easy
planning for CSS for a larger site is a difficult task

CSS project management should separate selectors and properties

selectors are about which things should be identified and styled
properties are about how this styling is implemented

CSS1 Selectors CSS for Dummies

Very small set of selectors

selecting elements by name: h1 { font-size : large }
selecting elements by their id: #author { font-weight : bold }
selecting elements by their class: .abstract { font-size : small }
combining these mechanisms: p.warning { color : red }

Pseudo-classes and -elements allow interesting effects

a links have state: a:visited and a:active
selection without markup: p:first-letter and p:first-line

CSS2 Selectors More Selectors

are available

element name, id, class, and combinations of these

CSS2 introduced many new selectors

descendants: ul li { font : italic }
children: ul > li { font : italic }
adjacent siblings: h1 + h2 { margin-top : 0.5em }
attribute matching: h1[lang=nl] { color : orange }

CSS2 selectors are sufficient for most tasks
Setting class attributes is very important

CSS2 Pseudo Classes

CSS1's pseudo-elements are available

link states and first letter and line of content

CSS2 adds more qualifications for elements

first child: p:first-child { text-indent : 0 }
dynamic behavior: a:hover { ... } a:active { ... } a:focus { ... }
language: :lang(de) { quotes: '»' '«' '‹' '›' }
: q:before { content : open-quote } q:after { content : close-quote }

Support for Internationalization (I18N) and Localization (L10N)

CSS3 Selectors CSS goes XPath

CSS3 selectors introduce a wide array of new features

XPath is a very general selection mechanism
CSS3 re-invents some XPath features using new names
other selectors are based on dynamic information, which is more useful

Some ideas are very useful

highlighting targets: *:target { outline : red thin solid }
selection highlighting: *:selection { ... }
(form) element states: input:disabled { ... }

Adoption and demand for other selectors is unclear

attribute substrings: p[title*="hello"] { ... }
counting children: p:nth-child(42) { ... }

CSS Mechanics Cascading

Stylesheets may have three different origins

page authors associate CSS with their pages
users configure their browser to use some CSS
user agents (browsers) have built-in CSS how to style content

Conflicts must be resolved using the following algorithm

find all matching declarations (matching media type and selector)
sort according to importance (browser < user < author)
same importance must be sorted by specificity (more specific selectors)
finally, sort by order in which they were specified

!important rules can influence the algorithm

they are interpreted in step 2 (sorting by importance)
browser < user < author < author(important) < user(important)

Inheritance

Properties often are inherited by children

setting a table's color sets the color for all contents
without inheritance, CSS stylesheets would have to be very large

Inheritance is mostly intuitive

in reality, it is a bit more complicated

specified value: what the property specified (, inheritance, or initial)
computed value: computed based on the environment (e.g., ex → px)
used value: converted to an absolute value (e.g., percentage widths)
actual value: specific for the environment (e.g., borders with pixel fractions)

Structuring Stylesheets

Stylesheets may need to be structured

importing CSS code is supported: @import "/dretnet.css" ;
modules of CSS code can be reused in different contexts

Stylesheets may be specific for a media type

braille, embossed, handheld, print, projection, screen, speech, tty, tv
specified in HTML: link rel="stylesheet" type="text/css" media="print" href="/print.css">
specified in CSS: @media print { ... }
media-dependent import: @import "/print.css" print ;

Conclusions CSS for Document Styling

Appropriate for HTML

Flexible selection of elements using
Powerful formatting of elements using
Interesting interface design with pseudo-classes and -elements

Inappropriate for XML

Assigning values to properties is too simple
XML documents often needs to be restructured
XSLT is the language for restructuring XML
XML → HTML+CSS is a popular Web publishing setup

XML Namespaces Thursday, September 14, 2006 XML Namespaces FAQ (Part I) W3C's XML Namespaces Specification XML is successful because it can be used in many different scenarios, and because it is easy to define a schema (such as a DTD) for new scenarios, producing a tailored XML data model for this scenario. This means that names in XML documents must be interpreted as belonging to a certain schema. As long as a document uses names from only one schema, this can be done rather easily. However, in many scenarios today documents combine names from different schemas, and XML Namespaces provide a mechanism how the names in an XML document can be associated with a namespace. Abstract

Class Survey

How to think about Namespaces Namespaces are Simple

XML Namespaces are often misunderstood

the biggest problem is to get rid of some assumptions
XML Namespaces are too simple and thus confusing

Instincts of Web users

URIs identify something that can be retrieved by a browser
URIs identify something that can be displayed by a browser
if I cannot get it and cannot look at it, what good can it be?

However, these assumptions are not always true

URIs identify resources which often, but not always, can be accessed over the Web
URIs identify resources which often, but not always, have a Web-accessible representation
sharing URIs means sharing an identity, which can mean sharing semantics (associated with this identity)

Simple Examples Name Spaces

Names are one form of identification
Identification is essential for communications
Names in XML are not suitable for identification

they are local to their context (where they are defined)
if the context is uniquely identified, the names would be, too

Name Spaces: Put names into spaces

how to identify the space? Web things are identified by URIs

URI Philosophy

uniquely identify resources
URIs often provide access information

pretty clear in http://dret.net/lectures/xml-fall06/
less clear in urn:ietf:rfc:2648 (RFC 2648)
very (and purposely) unclear in tag:9327493874329 (RFC 4151)

URIs often return resource representations

the resource itself is never returned (how to return a lecture?)
some representation often is useful (HTML, PDF, maybe video/audio)
the resource exists and is useful without a representation!

URIs are much more than just addresses of HTML pages

The Namespace Problem

People assume than URIs point to Web pages

a namespace name (a URI) may point to a Web page
it may also have no Web page associated with it
it may even use a URI scheme which cannot be retrieved
but it is still possible to compare URIs!

People assume some standardized content format

friendly namespaces provide HTML portals (XHTML)
some namespaces just give you the schema (SOAP)
less friendly namespaces provide minimal information (XSLT)
very unfriendly namespaces may return a 404 or even use inaccessible schemes
but they all are valid, because no resource representation is required!

Namespaces are used by comparing URIs

anything else maybe useful, but is not strictly required
when searching for a namespace definition, use Google (string search)

Using Namespaces Declaring Namespaces

Using a namespace means referencing names from it

unfortunately, there is no really standard way of writing these names
(the Clark notation is useful: {http://www.w3.org/1999/xhtml}html)
Namespaces are declared and then used

xmlns-prefixed attributes are used for declaring namespaces

Default: html xmlns="http://www.w3.org/1999/xhtml"
Prefix: xhtml:html xmlns:xhtml="http://www.w3.org/1999/xhtml"

Namespace declarations are inherited and can be overwritten

the default namespace can be undeclared
Namespace declarations can be used in a myriad of ways

Unhealthy Namespace Usages

Namespaces can be (and are) used in very weird ways

these are syntax variations of identical structures
without a good (i.e., conforming) parser, interpretation is very hard
copy/paste can become hard or impossible

Namespaces can be neurotic, psychotic, borderline, or normal
Each of the insane cases complicates processing
None of these has any real technical inaccuracies
XML should be used with humans in mind

Unhealthy Namespace Usages in Practice Elements and Attributes

Namespaces often apply to elements and attributes

if an element name has no prefix, it has no namespace or the default namespace associated
if a name has a prefix, the prefix must be bound to a namespace name
names like this are called Qualified Names (QNames)

Elements and Attributes are treated differently

the default namespace only applies to unprefixed element names
unprefixed attribute names are in no namespace
XML Schema deals with this by keeping attributes local

Applications should interpret QNames

naïve implementations will break when processing unhealthy instances
the mechanics of implementing namespaces are not very hard

Other Usages

Increasingly, QNames are used in content

XSLT was the first specification using this
many other technologies have followed

]]>

Technically, everything is well-defined

for processing, the namespace bindings must be known
copy/paste on a textual basis may not work or even work wrong

Defining Namespaces Any URI is Possible

A namespace name is a URI, that's all!

it may not be accessible (because of the URI scheme)
when retrieving it, nothing may be returned
when retrieving it, something may be returned

The only important thing is the name

the name is mentioned in the documentation
if you know the documentation, you known the name
shared names mean shared knowledge

Namespace Definitions

Namespaces can be defined by a DTD (XHTML)
Namespaces can be defined by an XML Schema (SOAP)
Namespaces can be defined by RELAX NG (XHTML 2.0)
Namespaces can be defined by prose (XSLT)
If schemas are provided, additional information is required

it is unlikely that a namespace can be fully described by a schema
additional constraints and semantics are specified in prose

Structured Namespaces

Namespaces have no structure

a collection of names grouped by their namespace name
inside the namespace, names have local meaning

Namespace definitions to make up their own rules

but then they must also make rules how to deal with conflicts

XML Schema structures the namespace defined by a schema

the different parts of the namespace are called symbol spaces
all XML Schema components have their own symbol space
simple and complex types share the same symbol space
locally defined elements/attributes are in sub symbol spaces

Fixed or Extensible?

Can a namespace change over time?

may the namespace description become outdated? extended? replaced?
this should be clearly documented in the namespace description

The XML XML Namespace was widely believed to be defined by XML

xml:lang and xml:space defined by XML
xml:base was added by XML Base
xml:id was added by xml:id

When defining namespaces, plan ahead and publish everything

dependencies, change management, and versioning issues are important
there still is no accepted standard for namespace descriptions

Namespace Descriptions

Erik Wilde, Structuring Namespace Descriptions, 15th International World Wide Web Conference (WWW2006), Edinburgh, UK, May 2006.

Processing Namespaces Namespaces and Validity

Namespaces define an additional layer on top of XML

they define additional semantics (assignment to namespaces)
they define additional constraints (declaration and usage of namespaces)

Namespace-awareness is a basic requirement for XML tools

XML not compliant with XML Namespaces will break most tools
processing namespaces should be done by tools
a namespace-aware parser translates namespace declarations into nodes

Namespaces in the Document Namespaces in the Tree

Conclusions Name Spaces

Bags of Names with a URI as a label
The URI does not necessarily return anything
Namespaces can be defined in any way (e.g., schemas)
Assignment 3 asks for CSS for simple HTML

possible inspirations: CSS Zen Garden

XML Path Language (XPath) Tuesday, September 19, 2006 XPath Chapter XPath QuickRef XML structures data into a rather small number of different constructs, most notably elements and attributes. The XML Path Language (XPath) defines a way how to select parts of XML documents, so that they can be used for further processing. XPath's primary use in in XSL Transformations (XSLT), but other XML technologies use it as well, e.g. XML Schema. XPath is a very compact language with a syntax that resembles the path expressions which are well-known from file systems. These path expressions, however, are generalized and therefore much more powerful than the rather simple path expressions in file systems. Because of its use in different XML technologies, XPath is one of the most important XML core technologies. Abstract

Why XPath? Selecting Parts of XML Documents

XML is a syntax for trees

it defines a way for how trees can be exchanged

XML technologies should provide for working with trees

when receiving trees, access to the tree should be easy (DOM)
validating trees should be easy (XML Schema)
mapping trees should be easy (XSLT)
XPath is like regular expressions for text-based information

Making Selection Reusable

Different XML technologies need selection

XSLT needs it for selecting parts and manipulating them
XML Schema needs it for applying identity constraints
DOM needs it for extracting parts from an XML tree
XQuery needs it for writing XML-oriented queries

XPath was created to be reusable

XML experts should only learn one selection language
this knowledge can be reused when learning new technologies
implementations can reuse code libraries

How XPath Evolved

XSL was designed as the new XML stylesheet language

XSL Transformations (XSLT) transform the input document
XSL Formatting Objects (XSL-FO) is what they will transform it to

XSLT was designed to work on arbitrary XML input documents

started as a part of XSL (WD-xsl-19981216 → WD-xslt-19990421)
the application area was XSL-FO, but not strictly limited to that
for selecting parts of the transformation input, a selection mechanism had to be provided

XPath was turned into a standalone specification

started as a part of XSLT (WD-xslt-19990421 → WD-xslt-19990709)
reused in a number of other W3C specifications (XML Schema, DOM, XQuery)

How XPath Works The XPath Tree Model Starting from the Infoset

XPath operates on an abstract data model

a tree derived from the
a simplification (another one!) of the underlying XML

The Infoset is turned into an XPath node tree

11 infoset item types → 7 XPath node tree node types
character items are merged into text nodes
namespace declarations are no longer visible as attributes

What is <u>Not</u> in the XPath Tree

The same things which are not in the Infoset

the order of attributes in a start tag
the types of quotes around attribute values
character references and entities (ü/ü → ü)

And some more...

namespace declarations are no longer visible as attributes
notations and unexpanded entity references

XPath Evaluation Tree In / Selection Out

XPath evaluates an expression based on a Tree
Where the tree comes from is out of XPath's scope
The result of the evaluation is a selection

//img[not(@alt)] → select all images which have no alt attribute
count(//img) → return the number of images
/descendant::img[3]/@src → return the third image's src URI
starts-with(/html/@lang, 'en') → test whether the document's language is english

Syntax errors may occur

XPath Location Paths Location Path Structure

Each location path consists of Location Steps

location steps are separated by /, like path names in file systems

Similarities between XPath location paths and file systems

nodes in the XPath tree have different types
the type and number of nodes selected by one step
the direction in which each step moves
additional filters for selecting specific nodes

Differences between XPath location paths and file systems

XPaths may return other data types than nodes
XPath provides a built-in function library

XPath Node Tests File System vs. XPath Paths

File System Path:	`/`	`usr`	`/`	`local`	`/`	`apache`	`/`	`bin`	`/`
# Selected Nodes:	1	→ 1	→	1	→	1	→	1

XPath:	`/`	`html`	`/`	`body`	`/`	`table`	`/`	`thead`	`/`	`tr`
# Selected Nodes:	1	→ 1	→	1	→	6	→	4	→	12

Tests for Nodes

Name tests

testing for a particular name (elements/attributes): /html/head/title
wildcards (testing for any name): /html/head/*

Node type tests

text nodes: text()
comment nodes: comment()
any nodes: node()

Processing instruction tests

any PI: processing-instruction()
specific PI: processing-instruction("xml-stylesheet")

XPath Axes Where Do You Want to Go Today?

File system paths are one direction only

always one level down in the file system hierarchy
. and .. are clever directory shortcuts
other directions supported by tools (e.g., find)

XPath allows steps is different directions

the default direction is child
other directions are explicitly specified: descendant::a

Ancestor Axis

Ancestor-or-self Axis

Attribute Axis

Attributes are not the children of elements, but ...
... elements are their attributes' parent!

very counter-intuitive
very convenient

Attributes are always leaves in the node tree
Attribute Nodes have the attribute value as their value

Child Axis

Descendant Axis

Descendant-or-self Axis

Following Axis

Following-sibling Axis

Namespace Axis

Namespace nodes are not the children of elements, but ...
... elements are their namespaces' parent!

very counter-intuitive
very convenient

Namespace nodes are always leaves in the node tree
Namespace nodes have the namespace name (i.e., a URI) as their value
Namespace nodes exist because of namespace declarations

in the XPath node tree, only the namespace nodes are visible
the namespace declaration attributes (xmlns) are invisible
one namespace declaration potentially creates many namespace nodes

Parent Axis

Preceding Axis

Preceding-Sibling Axis

Self Axis

Putting it all Together

XPath location paths use a simple syntax

sequence of location steps, separated by /

Each location step uses a simple structure (preceding::p[@class="warning"])

an axis followed by :: (no axis uses the default axis child)
a node test
0-n enclosed in []

Location paths can be abbreviated

child:: can be omitted (default axis)
attribute:: can be written as @
. is an abbreviation for self::node()
.. is an abbreviation for parent::node()
// is an abbreviation for /descendant-or-self::node()/

Predicates Location Step Filters

Predicates are filters for each location step

there can be any number of filters (0-n)
each filter is applied to each selected node individually

Each predicate is an XPath and evaluated as a boolean

the context of this evaluation is the node for which the filter is evaluated
if the result is a number, it is compared with the position() function (/descendant::a[5])

Predicates always reduce the set of selected nodes

as a corner case, the set of selected nodes does no change
predicates are used in the majority of non-trivial XPath location paths

Location Path Processing

Location paths are processed in a very simple way

start with a given context
for each location step, repeat the following steps:
based on the context and the axis, select the nodes on this axis
reduce this selection to the nodes identified by the node test
sequentially apply all filters to each of these nodes
take the remaining node set as the context for the next location step

XPath Expressions Beyond Location Paths

XPath is a full expression language

any evaluated expression in XSLT is an XPath
XPath must be able to calculate operate on non-XML data types

XPath uses a very simple data model

node sets: //img[not(@alt)]
number: count(//img)
string: /descendant::img[3]/@src
boolean: starts-with(/html/@lang, 'en')

XPath Usages

XPath is used in different technologies

XSLT uses XPath as its expression language
XML Schema uses XPath for selecting identity constraint nodes
DOM uses XPath as a way to select DOM nodes

Depending on the environment, expression must yield certain results

for conditionals, a boolean must be returned
iterations (in XSLT) only loop over nodes
when printing out text, a string must be produced

XPath has built-in rules for casting types

node set → boolean: empty is false, non-empty is true
node → string: take the string value (i.e., concatenate all text node descendants)
string → number: interpret as decimal notation (otherwise return NaN)
XPaths often return surprising results (//a[starts-with(@href, https)])

XPath Functions Function Library

XPath has a small library of built-in functions

useful for basic XPath-level functions
other specs are allowed to extend it (XSLT does it)

XPath functions return results of various data types

boolean: boolean, contains, false, lang, not, starts-with, true
number: ceiling, count, floor, last, number, position, round, string-length, sum
string: concat, local-name, name, namespace-uri, normalize-space, string, substring, substring-after, substring-before, translate
node set: id

Using Functions

Functions and location paths are orthogonal

each construct may be based on the other
it is possible to nest them arbitrarily
predicates often contain functions
//a[substring(@href,string-length(@href)-2)='pdf']

XPaths can become powerful and complex

writing some code or thinking about an XPath?
XPaths are more declarative
they may be more robust against changes in the XML schema
they can be optimized by a smart XPath implementation

Limitations of XPath XPath Selects

Query languages select and recombine

look up all addresses by zip code
for each zip code, count the number of addresses

XSLT fills in the missing parts (as a programming language)

XSLT can construct XML and re-apply XPath

XQuery fills in the missing parts (query-wise)

80% of XQuery are XPath (in version 2.0, though)
the remaining 20% are bindings, constructors, and glue

Conclusions XPath is Important

XPath is a basic tool of the XML toolbox
XPath is reused in various XML technologies
XPath selects parts of an XML document
XPath can do more general things by using expressions

XML Transformations (XSLT) — Part I Thursday, September 21, 2006 Because XML can be used to represent any vocabulary (often defined by some schema), the question is how these different vocabularies can be processed and maybe transformed into something else. This something else maybe another XML vocabulary (a common requirement in B2B scenarios), or it may be HTML (a common scenario for Web publishing). Using XSL Transformations (XSLT), mapping tasks can be implemented easily. XSLT leverages XPath's expressive power in a rather simple programming language. For easy tasks, XSLT mapping can be specified without much real programming going on, by simply specifying how components of the source markup are mapped to components of the target markup. Abstract

XPath and XSLT

XPath is an expression language

location paths let you select part of an XML document tree
expressions in general may other data types as well (string, number, boolean)

XSLT is a programming language based on XPath

XSLT defines the structures for the control flow within the program
in all the places where something is evaluated, XPaths are being used
sometimes, one can substitute for the other

XSLT Executive Summary

XSLT is an XML-oriented programming language
XSLT uses XML as its syntax
XSLT is a weakly typed language
XSLT is not designed for large programming tasks
XSLT is the standard language for XML-to-XML transformations
XSLT is very simple and often too simple
XSLT 2.0 is much more complex and powerful

XSLT as a Programming Language

XSLT is a functional programming language

fundamentally different from the usual languages
not important for very simple mapping applications
important for writing more complex transformations
hard to get used to for procedurally trained people

XSLT has built-in behavior for tree traversal

XPaths allows you to select parts of the document tree
XSLT's default behavior is to traverse the complete tree
the idea of default behavior may seem strange

Simple Examples My First XSLT

XSLT uses a simple environment

all you need is an XSLT processor (Saxon recommended)

Some interesting observations

it is an XML document (using the XSLT Namespace)
it contains no visible code (no statements)
when being applied (i.e., executed), it produces a result

Why does it Work?

The text of the document is produced

technically, it is the concatenation of all text nodes
this works with all XML input documents

XSLT by default traverses the document tree

it copies all text nodes
it works its way through the document recursively
this behavior is unusual for a programming language

My Second XSLT How does it Work?

Text output rather than XML output
Overriding the default behavior

new rules for how to recurse through the document tree
the rules are applied by the XSLT processor
the execution of the XSLT code is controlled by the XSLT processor

Traversing the document tree in XSLT is easy

this is what XSLT has been designed for
trying to avoid this pattern leads to bad code and bad results

My Third XSLT How Mappings Work

All non-XSLT elements are literal result elements

their content is processed as usual
they may contain XSLT or literal result elements

XSLT elements in the stylesheet are instructions

they are executed and have some predefined behavior
if they produce results, these go to the result tree as well

One-template XSLT is a good way to start with XSLT

avoiding the learning curve associated with
for easy mapping tasks, this pattern often is sufficient
for complex tasks, this is the XSLT equivalent of spaghetti code

<q>Hello World</q> in XSLT

XSLT always transforms an XML document

this is hard-coded in the

Simply generating output is impossible

hello world therefore ignores the input
anything can be the input (including the XSLT itself)

XSLT Instructions XSLT is RISC

XSLT has a small set of instructions

the language was designed to run in a restricted environment
the language was designed for a specific task
much of the languages power lies in XPath

XPath is the CISC part of XSLT

XPath is a complex high-level language
it is specialized for the task the language is designed to do
it can be highly optimized
writing the XPaths often is the most challenging part of XSLT

Starting with XSLT should improve simple mappings

Iterations

XSLT can only iterate over node sets

any other problem has to be solved recursively
iterating over node sets often is what you want to do

Applying the same code to all of the nodes

works great if all nodes require the same processing
is of limited use when processing needs to be conditional

Conditional Instructions

Programming languages usually provide if-then-else

XSLT has an if-then: if
and a if-then-(elif-then)*-else: choose

Simple handling of special cases

having few and reasonably sized conditionals is ok
having deeply nested and very long conditionals is a problem
as in all programming languages, the latter case should user other mechanisms

My Third XSLT (II) My Third XSLT (III) Conclusions XSLT is Simple

XSLT is a simple programming language
XSLT's processing model is useful but unusual
XPath competence is essential for XSLT
Programming requires practice

XML Transformations (XSLT) — Part II Tuesday, September 26, 2006 XSLT processes documents by matching nodes in the document tree to templates, which then are executed to process these nodes. This process of matching and executing templates is the core of XSLT's processing model. XSLT has built-in templates which complement the user-supplied templates, so that the XSLT processor always finds a template to execute. Templates can conflict, and it is then necessary to resolve this conflict by finding the best match of all matching templates. This conflict resolution process also is a very important component of the XSLT processing model. Abstract

XSLT Programming

Simple mappings can be defined in one template

the template creates the result document's structure
and provide some flexibility for processing
the resulting code is always spaghetti code

Non-trivial XSLT programs use more than one template

different templates are responsible for mapping subtrees of the input document
the whole process is driven by the document
XSLT programming needs some time to get used to

Like every tool, XSLT can be misused

for simple problems, XSLT can be used like a regular programming language
for harder problems, this is impossible (missing language constructs)

XSLT Processing Model Input and Output

Templates Templates as Building Blocks

Templates are the main unit of code

the match attribute defines which nodes are processed by a template
whenever such a node needs to be processed, the template is executed (applied)
XPaths are interpreted with the matched node as context

Templates contain a mix of and XSLT code

and tex nodes are copied to the result tree
XSLT elements are executed (depending on their semantics)
apply-templates plays a special role because it selects nodes to be processed

The template application process is special

probably the most challenging aspect when learning the language
XSLT is much easier to use when understanding the underlying principle

Basic Mechanics

The source node list contains only the root node
The result tree is created by inserting the result from processing a node from the source node list
Processing typically puts more nodes on the source node list
The process is repeated until the source node list is empty

Template Selection

Templates are connected through two statements

apply-templates selects which are put on the source node list
the XSLT processor selects the best template and executes it

What happens if there is no template?

templates use to specify their applicability
users may not specify a template for a node they select
instead of an error, are used to handle this situation

Patterns

Patterns are a subset of XPath

they are used to specify to which nodes certain language constructs apply
patterns specify a set of conditions on a node

The specification is short, but hard to understand

A node matches a pattern if the node is a member of the result of evaluating the pattern as an expression with respect to some possible context; the possible contexts are those whose context node is the node being matched or one of its ancestors.

Practically, patterns are node tests, node contexts, and predicates

* matches any element
tr matches tr elements
thead/tr matches tr elements within thead elements
p[@class='warning'] matches p elements with their class set to warning
these mechanisms can be combined (and connected by the union operator |)

Pattern-Based Processing Built-In Templates XSLT Default Behavior

Built into every XSLT processor

covering all seven XPath node types
the XSLT processor always finds a template to process a node

Conflicts are thus also built into the language

every user template is in conflict with a built-in template
is a core concept of XSLT

Root and Elements

The most important node types

every XML document has a root and at least one element

The default behavior traverses the tree recursively

the recursion only selects child nodes (the default is select="node()")
attributes are not children of the elements nodes!

Text and Attributes

These nodes create text output
the processing does not continue with apply-templates

text and attribute nodes are always leaf nodes

Attributes are not selected by the built-in rules

they are only processed when selected by a user instruction

Processing Instructions and Comments

These nodes are ignored
Processing instructions and comments are selected by the built-in rules

the built-in behavior can be overwritten if required

Conflict Resolution Template Selection

XSLT processes nodes on the source node list
For processing each node, the best template must be found
XSLT supports incremental development

templates can be added for more specialized processing
other code does not have to be changed at all
the source node list provides support for this decoupling

For simple cases, the default mechanism is sufficient

advanced XSLT programming sometimes requires manual intervention

Template Selection

All templates with a match attribute

this excludes

All templates with the same mode

part of the apply-templates instruction selecting the node

The Pattern must match
If more than one template matches, order by import precedence

the import tree of the stylesheet is considered (this includes the built-in rules)

If more than one template matches, order by priority

this sorts rules according to the specificity

Execute resulting rule

if still more than one, signal error or execute last in stylesheet

Import Precedence

Priorities

Template priorities are computed

a very simple pattern-based process
a higher value means it is a better match

Five steps are used to compute the priority

templates using the union operator are treated as if there were multiple templates
QNames and processing instructions are assigned a priority of 0
Namespace-prefixed names are assigned a priority of 0.25
other node tests with axis specifiers are assigned a priority of -0.25
all other patterns are assigned a priority of 0.5

Different Conflicts Resolution Process

Pattern	Priority	Resolution Step					Manual Adjustment
Pattern	Priority	1	2	3	4	5	Manual Adjustment
Built-in: `text() \| @*`		✓	✓
Built-in: `* \| /`		✓	✓	✓
`*`	-0.5	✓	✓	✓	✓
`a`	0.0	✓	✓	✓	✓
`b/a`	0.25	✓	✓	✓	✓	✓
`c/b/a`	0.25	✓	✓	✓	✓	✓	`priority="1"`

Adjusting Priorities

Computed priorities always lie between -0.5 and 0.5
Non-trivial patterns almost always have the priority 0.5
Priorities can be set explicitly

template match="..." priority="1"

Managing priority values is up to the programmer

it is rarely necessary to manage a large set of competing priorities

How to Iterate Processing Nodes in XSLT

XSLT supports to ways of processing nodes

loop over a set of selected nodes
process nodes which have been put on the source node list

Both mechanisms handle similar situations

a set of nodes is selected and should be processed
the code for processing has to available in a code block
put this code in the for-each body
put this code in a reusable building block

Homogeneous Processing

may lead to less modular code
If the code has to be reused, they may not be a good solution

may provide some support for reuse

The selected nodes should require similar processing

otherwise, the iteration code will contain many conditional statements

Iterations should be restricted to small units of code

Heterogeneous Processing

If the node processing is very different, templates are better

different templates are written for all nodes being selected
no conditional code has to be written, selection is done by matching nodes to template patterns

Templates can be reused

the nodes appear in different locations and should be processed consistently
the matching mechanism provides the ideal support for this scenario

Extensible code should always use templates

other stylesheets can import an existing stylesheet
by selectively overwriting templates, the behavior can be customized

Calling Templates Executing Templates

usually have a match attribute

these templates are part of XSLT's special pattern matching processing

Templates may also be named units of code

there is nothing special about these templates
they are being called using a name like regular procedures

Named Templates

template may also carry a name attribute
Named templates have none of the special properties of XSLT template matching

they are called by their name just like regular procedures
they do not change the context of XPath evaluation

Named templates are useful for modularizing code which is not tied to node types

in most cases, they are called using
a typical application is the implementation of a facility for printing messages

Conclusions Document-Driven Transformations

XSLT often requires document-driven programming
Imperative programmers are more used to control the program flow
Document-driven processing is a powerful design principle
Complex (highly variable) documents are much better handled by document-driven processing

XML Transformations (XSLT) — Part III Thursday, September 28, 2006 XSLT Parameters Advanced XSLT processing includes better control of the input and output documents, which can finely controlled in terms of how whitespace is treated. Another interesting feature of XSLT are keys, which allow shorthand notations for frequently used access paths to nodes, and provide XSLT processors with more information for performance optimizations. Instructions for creating all possible kinds of nodes in the output tree make it possible to write code which generates element or attribute names based on runtime evaluations. Abstract

XSLT Core Concepts

XSLT can be used for very simple matching tasks

a mostly static result tree can be produced
XPaths can be used to fill in parts of the result tree
and provide some flexibility for processing

More complex transformation require a different approach

instead of static structures, nodes are individually mapped to small structures
these structure fragments together produce the result tree
the process is document-driven and based on the

Variables and Parameters Programming Language Basics

Variables in programming languages have different purposes

defining a name for something so that it can be referred to
associating this name with a value so that the value can be used
providing a way to update the variable so that its value changes

Variables in functional languages cannot change

they are immutable (often called constants in other languages)
more specifically, they are dynamic constants (i.e., can be computed at runtime)
they are defined by giving them a name and referring to them by $name

Variables in XSLT have no type (no static type checking possible)

the value that they have is typed
but a variable may have values of any type

Variables Why Variables?

Reuse of values in different locations

texts required for the transformation
facilitates better separation of structure and content

]]>

Using the correct context is essential

variables cannot be updated
if they need to be updated, they have to be re-created

Why are they called variables if they are constants?

their value varies in different invocations of the context
they are computed at runtime (dynamic constants) rather than statically

Scope and Extent

Variables can be global or local

global variables are visible in all templates
local variables are visible in their context (i.e., at following-sibling::*/descendant-or-self::*)
local variables are allowed to shadow global (not local) variables

Variable values may be assigned using the select attribute

The XPath's result is the value of the variable

Variables can contain arbitrary XPath code

the code is executed in the same way as when constructing the result tree
the result tree fragment is the value of the variable
it can be used as a string (value-of) or as a tree (copy-of)

Using Variables Parameters Parameters vs. Variables

Parameters are variables with additional semantics

they are passed to their scope from the outside
they are available within the scope like a variable (scopes are stylesheets and templates)
like variables, they cannot be updated (and only global parameters can be shadowed)

XSLT does not check proper parameter passing

if a declared parameter is not passed, it gets a default value (specified or '')
if a passed parameter is not declared, it is ignored
like variables, parameters have no type (any value can be passed)
XSLT's robustness makes it hard to spot programming errors

Stylesheet Parameters

Passed to the stylesheet when calling the stylesheet

the exact way of specifying the parameters depend on the processor and the environment
the passed values are available in the same way as global variables
parameter checking has to be done by hand

Template Parameters

Parameters can be passed to templates

works with apply-templates and call-template
with-param elements list the passed parameters
parameter matching is done by name (there is no particular order to parameters)

Templates can be programmed as parametrized components

checking the signature has to be done by hand
with-param elements list the passed parameters
parameter matching is done by name (there is no particular order to parameters)

Parametrized template calls need a lot of markup

XSLT's XML syntax makes the code hard to read

main param start = 1 ; param count = 10 ; {
	loop (0) };
loop param counter ; {
	print $start + $counter ;
	if ( $counter < $count - 1) then 
		loop ($counter + 1) ; }

Parameter Passing Message Facility Controlling Documents XSLT Processing Model

XSLT was built as a client-side language

the browser has an XML document
the XSLT is used to transform this XML
the result is used for rendering the formatted document

XSLT provides facilities for accessing additional documents

an additional XML might contain localized texts for rendering
like everything in XSLT, identification uses URIs

Input Documents Opening Documents

Initially, XSLT starts with the XPath node tree of the main document

this step is outside of the control of the XSLT programmer

Additional documents can be accessed using document()

the function accepts URIs, which are interpreted relative to the stylesheet
only XML documents can be used, they will be parsed into an XPath tree

XSLT Processors are smart enough to cache documents

re-opening the same document will not re-parse it

Whitespace in Documents

Documents often contain many irrelevant whitespace text nodes

many XML documents are pretty-printed for readability
pretty-printing produces many line-feeds and tabs/spaces

XSLT can be instructed to ignore whitespace nodes

strip-space lists all elements for which whitespace children should be ignored
this may be a bit too much, because may contain significant whitespace

do not throw away these whitespace nodes!
]]>

XSLT can be instructed to preserve some whitespace nodes

preserve-space lists all elements for which whitespace children should be preserved
usually, preserve-space lists the exceptions for strip-space
usually, preserve-space contains a list of all mixed content elements

Controlling Whitespace Output Documents Serialization

XSLT always produces a result tree

stylesheet processing starts with an empty tree (root node only)
XSLT code producing output then adds nodes to this tree
text, value-of, copy-of, copy, element, attribute, comment, processing-instruction,

Serialization is the process of externalizing the final tree

output controls how the tree is serialized
xml writes the tree as an XML document
html writes the tree as an HTML document (img ... instead of img .../)
text writes the tree's string value (the concatenation of all text nodes)

Multiple Output Documents

XSLT 1.0 does not support more than one output document

message is another output channel, but not a document
this was one of the most requested features for language improvements

How can stylesheets produce more than one document?

XSLT 1.0 may produce one document which is then post-processed
XSLT 2.0 offers language facilities for more than one output document

Keys Document Access

Some parts of documents may be accessed frequently

//person[@ss = $ss]/name/surname for getting a name by social security number
costs depend on document size and access frequency
the document structure has to be used in all places where the name is used

Keys provide access to frequently used nodes

key('ssKey', $ss)/name/surname is based on a predefined access path (the key)
very easy to optimize even for very simple XSLT processors
easier to understand from the programmer's point of view

For nested predicates, non-optimized evaluation is very expensive

//reference[@crossref = //reference[@title = $title]/@name]

Declaring and Using Keys

key defines a key on the stylesheet's top level

key name="ssKey" match="person" use="@ss"/
name is used for referring to the key (most people use ...Key)
match selects all nodes which will be part of the key (i.e., accessible through it)
use selects the value(s) which will retrieve the nodes

key() is used for retrieving nodes from a key

the first argument specifies the name of the key (defined by key name="..." ...)
the second argument specifies the value for which to look in that key
key() returns a node set (empty or any number of nodes)

XML and XSLT for using a Key

key('preNameKey', 'Thomas') ≡ //name[pre = 'Thomas']

XSLT Key Structure

preNameKey
Node	Value
[1] Erik Thomas Wilde	Erik
[1] Erik Thomas Wilde	Thomas
[2] Thomas Plagemann	Thomas
[3] Bob Glushko	Bob

countryKey
Node	Value
[1a] Erik Thomas Wilde	de
[1b] iSchool/UCB	us
[2a] Thomas Plagemann	de
[2b] IFI/UIO	no
[3a] Bob Glushko	us
[3b] iSchool/UCB	us

Using Keys

Finding nodes by intersecting key() results

key() always returns node sets
interesting sets of nodes may be the intersection of several keys
unfortunately, XPath does not provide an operator for set intersection

Node Set Intersection

$a[count(. | $b) = count($b)]: Find all nodes in $a where the cardinality of $b does not change when adding this node to it. This means the node must be in $b, and it is in $a to start with.

Generating Result Nodes Literal Result Elements

Non-XSLT elements are copied to the result tree

this is the most common way of producing nodes
in this case, the nodes' names are hard-coded in the stylesheet

Attributes are also copied to the result tree

this means the attribute will always we there
conditional creation of attributes needs other language constructs

Producing Nodes Explicitly

Element nodes can be produced by using element

the element name must be specified and can be computed
additional instructions exist for all node types

Modularizing Stylesheets Including and Importing

XSLT supports two ways of modularizing code

including simply distributes code across multiple files
importing creates a dependency and a hierarchy

include is mainly used for keeping files manageable

it is used within managed projects

import is mainly used for reusing code from elsewhere

it imports reused code and assigns this code a lower precedence
local instructions can then overwrite (if required) some of the imported code

Import Precedence

Conclusions XSLT in Practice

XSLT is a simple programming language
The processing model needs some time to get used to
Sometimes the language is really too simple
If you are really interested in XSLT, learn XSLT 2.0!

XML Schema — Part I Tuesday, October 3, 2006 Chapters 4.3 & 4.4 (pp. 132-159) XML Schema QuickRef XML Schema is the most popular schema language for XML today. It has been introduced to overcome some of the commonly observed limitations of DTDs, most notably the lack of typing. Simple Types describe content which is not structured by XML markup, which means it describes attribute values and element content. Simple types can be defined by deriving new types from existing types by using type restriction. Complex Types describe element content if this content is using attributes and/or element content other than only character data. Using XML Schema's type concepts, it is easier to represent model-level information in a schema, because type hierarchies can represent model-level specializations. Abstract

Bad Names

XML Schema is a language for describing an XML schema.
An XML schema can be defined using XML Schema.
I would like to use XML Schema for my XML schema.

The two most awkward name choices in the XML arena:

XML Schema, which is simply a XML schema language (among many others)
Open XML, which is simply an XML language for encoding office documents

Naming things means getting into people's heads

pretentious and all-embracing name choices serve a certain purpose
a name is just a name, it has no meaning
XSD and WXS are two semi-official acronyms for XML Schema

What's Wrong With DTDs?

DTDs do not support application-level datatypes

XML for B2B is very data-centric and needs typing
SGML was created for documents where typing was less important

DTDs do not support any relationships between markup constructs

content models cannot be reused
attribute lists cannot be reused
structural relationships cannot be exploited in the DTD
are used as a hack to work around this limitation

DTD + XML Namespaces = Bad idea!

Different Levels of Semantics

XML Schema's simple data type provide some semantics

a formerly undescribed attribute can now be described as being a xs:date
it can be understood as being a date and inserted into a calendar
but what kind of date is it? a birthday? an order date? a shipping date?
a question of the context of where the xs:date appears

XML Schema better supports model-level information

however, XML Schema also only captures part of the application semantics
an XML Schema is usually better than a DTD, because it contains types
types provide information about the basic datatypes being used
additional semantics (e.g., different kinds of dates) must be documented elsewhere

Schema-Validation and Applications

Validation and Typing

XML Schema does two things at the same time:

Validation checks for structural integrity (is the document schema-valid?)

checking elements and attributes for proper usage (as with DTDs)
checking element contents and attribute values for proper values

Type annotations make the types available to applications

instead of having to look at the schema, applications get the Post-Schema Validation Infoset (PSVI)
type-based applications (such as XSLT 2.0) can work on the typed instance

XML Schema Types What is a Type?

A type is a set of values

the values can be enumerated (home, mobile, office)
the values can be described by extension (intervals, regular expressions)

DTD have (almost) no types

element content is always #PCDATA (any number of any characters)
attributes most often are CDATA (any number of any characters)
attributes may have enumerated types (but no extensional types)
attributes may use

XML Schema vs. DTD

	DTD	XML Schema
Concepts	some conceptual model (formal/informal)
Types	ID/IDREF and (#P)CDATA	Hierarchy of Simple and Complex Types
Markup Constructs	Element Type Declarations <!ELEMENT order ...	Element Definitions <xs:element name="order"> ...
Instances (Documents)	<order date=""> [ order content ] </order>

Document/Data Perspectives

XML as documents is text interspersed with structure

XML captures text structures that support document processing
without these structures, the text remains usable (as unstructured text)
structure is good, but not indispensable

XML as data is structure filled with data

programmers think about classes and objects, so they need types
without structure, data-centric XML is completely useless
programmers often view XML as wire format and types as the portal to their objects

Simple Types What are Simple Types?

Simple types describe values not structured by XML markup

they describe attribute values (date="2006-10-03")
they describe element content (<phone>+1-510-6432253</phone>)

Simple types can be used for elements or attributes

XML Schema treats contents in elements and attributes equally
simple type libraries can be designed independent of their eventual use

Simple types are available in three flavors

atomic types: one value of one type (one number in some range)
union types: one value of a union of types (a number or the string undefined)
list types: a whitespace-separated list of values (phone type="home office")

Named vs. Anonymous

Types can be named or anonymous

named types have a name and can be referenced (and thus be reused)
anonymous types have no name and can only be used where they are defined

Type Definitions

Simple types are sets of values

named simple types are sets of values with a name (and thus reusable)
anonymous simple types are sets of values defined where they are needed

Simple types are defined to represent model-level information

in most cases, they will have restrictions associated with them
they may also simply be tags for semantics (fax and phone numbers share the same value space)

XML Schema has a library of built-in datatypes

ur-types are the conceptual grounding of all types
primitive types are the types that are there by definition
derived types are based on primitive types
users can derive their own types using simple type restriction

Type Hierarchy

Simple Type Restrictions Built-In Types How to Restrict

Simple types can be derived by restriction

the base type must be a simple type
the derived type will be a simple type
all simple types form a tree, rooted ad the anySimpleType

Restriction are based on facets

each restriction can use 0-n facets
facets can be refined in further simple type restrictions
XML Schema designers should try to restrict types as much as possible

Facets

Facets define a certain way of restricting a simple type

facets are independent, but they may interact (minLength and maxLength)
XML Schema defines 12 constraining facets which may be used for restrictions
length, minLength, maxLength, pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minExclusive, minInclusive, totalDigits, fractionDigits

Facets may be repeated in different levels of the type hierarchy

they may only further restrict the facet (e.g., reducing the maxLength)
facets apply to all directly or indirectly derived subtypes
facets may be fixed (no further restriction is allowed)

Not all facets are applicable to all types

the applicability depends on the primitive type being used

Facet Applicability

`string`	length, minLength, maxLength, pattern, enumeration, whiteSpace
`boolean`	pattern, whiteSpace
`float`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`double`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`decimal`	totalDigits, fractionDigits, pattern, whiteSpace, enumeration, maxInclusive, maxExclusive, minInclusive, minExclusive
`duration`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`dateTime`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`time`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`date`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`gYearMonth`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`gYear`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`gMonthDay`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`gDay`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`gMonth`	pattern, enumeration, whiteSpace, maxInclusive, maxExclusive, minInclusive, minExclusive
`hexBinary`	length, minLength, maxLength, pattern, enumeration, whiteSpace
`base64Binary`	length, minLength, maxLength, pattern, enumeration, whiteSpace
`anyURI`	length, minLength, maxLength, pattern, enumeration, whiteSpace
`QName`	length, minLength, maxLength, pattern, enumeration, whiteSpace
`NOTATION`	length, minLength, maxLength, pattern, enumeration, whiteSpace

Patterns

Patterns restrict the lexical space of simple types

most other facets restrict the value space (e.g., intervals of numbers)
in many cases, patterns are useful additions to value-oriented facets

Patterns are regular expressions

they support many common regex constructs and Unicode
the language pattern allows de, de-CH, and other tags
the pattern checks for lexical correctness, not against a code list

([a-zA-Z]{2}|[iI]-[a-zA-Z]+|[xX]-[a-zA-Z]{1,8})(-[a-zA-Z]{1,8})*

Simple Type Examples Facet Limitations

Facets limit one dimension of a type's value space

using pattern, the lexical space can also be restricted
restrictions should be made as specific as possible
no limitations are possible beyond the predefined facets

There is no connection to the context within the document

facets cannot make references to other values (e.g., neighboring attributes)

Additional constraints should be documented

documentation enables applications to implement constraint checking
other schema languages (like ) may be used to express these constraints

Complex Types What is a Complex Type?

Complex types describe the allowed element content

they describe what the element may contain (the element's content model)
they describe the attributes that an element may have (the element's attribute list)

Complex types do not define the element name

the complex type defines which content is allowed for the element
the element definition uses the complex type to define the allowed element content

Complex types have similar properties to simple types

they can be named or anonymous
can be used to construct a type hierarchy

Complex Type Example Complex Types & Content Types

Complex types can have different kinds of content

simple content refers to simple type content using additional attributes
complex content is anything else (anything beyond simple type content)

heavily depends on this classification

Simple Types	Complex Types
	Simple Content	Complex Content
	Simple Content	Element Only	Mixed	Empty

Content Models DTD Content Models

in DTDs uses a compact syntax

XML Schema supports the same facilities with a more verbose syntax
XML Schemas adds features which DTDs do not support

DTDs allow elements to be mandatory, optional, repeatable, or optional and repeatable

XML Schema allows the cardinality to be specified

DTDs allow sequences (,) and alternatives (|)

XML Schema introduces a (very limited) operator for all groups

Apart from the syntax, XML Schema content models are not very different

Mixed Content

DTDs define mixed content by mixing #PCDATA into the content model

DTDs always require mixed content to use the form ( #PCDATA | a | b )*
the occurrence of elements in mixed content cannot be controlled

XML Schema defines mixed content outside of the content model

the content model is defined like an element-only content model
the mixed attribute on the type marks the type as being mixed

XML Schema mixed content can use all model groups

it is possible to constrain element occurrences in the same way as in element-only content
in practice, this feature is rarely used (mixed content often is very loosely defined)

Empty Content

DTDs have a special keyword for empty elements

instead of the content model, the keyword EMPTY is used
empty elements may still have attribute lists associated with them

XML Schema empty types are defined implicitly

there is no explicit keyword for defining an empty type
if a type has no model group inside it, it is empty (it still may have attributes)

Conclusions Typed XML Structures

XML Schema introduces a type layer to schema languages
Types facilitate abstractions (and thus modeling)
Simple types can be restricted to yield more specific types
Complex types define how elements have to be used

XML Schema — Part II Thursday, October 5, 2006 XML Schema Identity Constraints XML Schema allows greater flexibility in defining constraints on intra-document references than the ID/IDREF construct of DTDs. XML Schema's Identity Constraints are scoped, typed, and can be used for elements or attributes. The second aspect of XML Schema discussed today is the derivation of complex types. Complex types can be derived by restriction or extension. Complex type restriction defines the restricted type to be a more restricted version of the base type. Complex type extension make it possible to extend the base type by either adding attributes or contents (only by appending new content to the content model). Abstract

Local and Global Definitions Named and Anonymous Types

Types can be named or anonymous

named types can be reused (for elements, attributes, or type derivation)
anonymous types can only be used where they are defined

DTD types are always anonymous (they cannot be reused)

<!ELEMENT person (name, address) >
<!ATTLIST person id ID #REQUIRED >

DTDs have everything hardcoded

complex types are always locally defined
elements are always globally defined
attributes are always locally defined

Elements Local vs. Global Elements

Elements can be defined in a type or in the schema

local elements can only be used where they are defined
global elements can be reused, they can serve as building blocks

Elements and complex types depend on each other

an element is defined by a type, often this will be a complex type
a complex type is defined by its contents, which are elements and/or attributes

Reusable Elements Attributes Attribute Definitions

DTDs treat attributes as something entirely different from element content

they are defined in an ATTLIST, not in the ELEMENT definition

<!ELEMENT person (name, address) >
<!ATTLIST person id ID #REQUIRED >

they have a special range of as opposed to elements

<!ATTLIST person id ID #REQUIRED >

XML Schema overcomes these restrictions only partially

are used to define attribute (or element) contents
attributes are still described as something entirely different from an element's content model

Attributes could be better integrated into the model

treats attributes as part of an element's content model
this makes it trivial to have choices of element content and attributes

Reusing Attributes

DTDs treat attributes as something local to an element

attributes are defined in an element's ATTLIST
reusing attributes for more than on element requires

XML Schema better supports reuse of schema components

types can be defined locally (anonymous) or globally (named)
elements and attributes can be defined globally or locally

Globally defined attributes can be reused

the attribute definition does not tie it to any occurrence
the attribute can then be referenced from an complex type definition

Reusing Attributes (Example) Names and Namespaces Definitions

Many XML Schemas define a vocabulary for a namespace

DTDs do not have any support for namespaces
XML Schema heavily builds on

XML Schema provides support for declaring a vocabulary's namespace

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/">

Schema-validation can check for proper namespace usage

the targetNamespace has to be used in the instance
if the namespace does not match, validation cannot succeed

Instances

The schema defines the targetNamespace of the vocabulary

all globally defined elements, attributes, and types are in that namespace
the instances must declare and use the namespace to be schema-valid

A prefixed name is not the same as a qualified name

if there is a default namespace, unprefixed elements are still qualified

Nasty details about XML Namespaces and attributes

the default namespace does not apply to attributes
attributes must therefore always be prefixed if they need to be qualified

Name Qualification

Global elements and attributes have to be used as qualified names

this means that they must be referred to by their namespace-qualified name
if a default namespace is used, elements are qualified without carrying a prefix
since the default namespace does not apply to attributes, they always must be explicitly prefixed

Local elements and attributes may be used qualified or unqualified

this control only applies to locally define elements or attributes
the default defined by XML Schema is not a good choice
because of how XML Namespaces work, a non-default choice is recommended

XML Schema allows control over how local names have to be used

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/" elementFormDefault="qualified" attributeFormDefault="unqualified">

Identity Constraints Element = Type + Constraints

DTDs and XML Schema are mainly about specifying grammars

types describe the allowed values using grammars
grammar-oriented schemas have some nice properties

DTD's allow additional constraints

apart from the grammar definition, cross-references in the tree are supported
validation checks the integrity of the cross-references, not only the tree

DTD's ID/IDREF are a very simple mechanism

they are always global
they also define the type of the attribute (XML names)

Improvements over ID/IDREF

XML Schema's Identity Constraints improve DTD's ID/IDREF
Identity constraints are scoped and apply only to a selected set of nodes

the constraint applies only to a selected set of nodes (using XPath)

Identity constraints are evaluated using typed values

IDs must be XML names (no numbers allowed)
2 ≟ +00002 should be evaluated based on the type (string or decimal?)
XML Schema separates the constraint from the type of the selected nodes

Identity constraints may select elements or attributes

XPaths are used to select the constrained values, they can select elements or attributes

Multiple fields

it is possible to select for than one field for a constraint (phone & area code must be unique)

Types of Identity Constraints

Uniqueness constraints

if there is a field, it must have a unique value among the selected nodes

Key constraints

there must be a field, and it must have a unique value among the selected nodes

Key reference constraints

the field must refer to an existing value in the referred key
if the key reference also is constrained by a key, only one reference may use the referred key

Identity Constraint Definitions

Identity constraints are part of an element definition
There are three important factors to an identity constraint

location of the identity constraint's definition
the nodes to which the constraint should be applied
the fields which are used to evaluate the constraint

If the constraint is a key reference constraint, there is a fourth factor

the key constraint that is used for checking the references

Identity Constraint Evaluation

Advanced Identity Constraints

Complex Type Derivation Type Derivation

XML Schema supports the modeling approach of specialization

simple types can be restricted to create more specialized simple types
each value of a restricted type is also a valid value of the more general type

Complex types are combinations of content and attributes
Specialization of complex types can be done in two ways

: more restricted ways of using the content and/or attributes
: additional content and/or attributes may be used

Both kinds of complex type derivation can be regarded as specialization

: for US persons the country must always be set to US
: people having an employee number are employees

Complex Type Restriction Removing Choices

Complex types usually allow variability

minOccurs and maxOccurs allow variability in occurrences
choice groups allow to choose between a number of alternatives
attributes may be flagged as use="optional"
simple types allow the individual values to use certain sets of values

Complex type restriction allows restrictions of all these variations

minOccurs and maxOccurs can be made more restrictive
alternatives can be removed from choice groups
optional attributes can flagged as use="required" or use="prohibited"
the simple types of values can be set to more restricted simple types

The technical way of defining restrictions is cumbersome

when the base type changes, the restricted type has to be fixed by hand

Complex Type Restriction (Example) Processing Restricted Complex Types

Values of restricted types are values of the base types

type restriction is defined so that restricted type values are always base type values
code processing a type can be reused to process restricted types

If there is a well-designed type hierarchy, programming becomes easier

simple code can be written to handle the basic types
if required, more advanced code can be written for the restricted types
in many cases, restriction is more for validation than for processing

XML Schemas may even use abstract types

no element will ever use the addressType
concrete elements will only use restricted types
there can be code handling the addressType which handles all addresses

Complex Type Extension Adding Content

Complex types are element content and attributes

extensions can add content, but only at the end of the base content
extensions can add attributes (order is not significant for attributes)

Adding content to existing content may not change the existing content

if the content is element only, it has to remain element only
if the content is mixed, is has to remain mixed
if the content is empty, it may become element only or mixed
the reason for these rules is that mixed is a global property of a type

Adding attributes simply adds these to the list of existing attributes

the added attributes may be optional or required

Complex Type Extension (Example) Processing Extended Complex Types

Values of restricted types are not values of the base types

type extension adds content add/or attributes to a type
if content is added, it is always added at the end of the base type's content

If there is a well-designed type hierarchy, programming becomes easier

simple code can be written to handle the basic types
if that should handle extended types, it must be written to handle extensions
handling extensions can be as simple as skipping them

XML Schemas may even use abstract types

no element will ever use the addressType
concrete elements will only use restricted types
code handling extended types can build on code handling the base type

Conclusions Schema Components

XML Schema Features

XML Schema allows defining a grammar for XML documents
Types make it easier to turn a model into a grammar
Identity constraints enable non-grammar constraints to be expressed
Some of the things we have not seen:

named groups, modularizing schemas, wildcards, substitution groups, ...

Felix Michel ETH Zürich, Switzerland From Model to Markup Tuesday, October 10, 2006 While XML is very useful for representing and manipulating structured data, the question remains where these structures come from. They are usually some kind of encoding for a conceptual model, but there is no established and universally accepted way of how to connect the modeling world with XML markup. Some of the challenges and approaches to XML and modeling will be presented in this lecture. The goal of this lecture is to raise awareness for the current gap between models and markup, and for practical approaches how to bridge that gap. Abstract

About Me

Felix Michel
A Visiting Student Researcher from ETHZ doing his Master's Thesis @ UC
My Thesis' subject is Visualization of XML models and their mapping to schema languages
Erik Wilde is my advisor
My english is not schema-valid, nor is my pronunciation

control@brain$>: ./english-parser --validation=lax

Motivation Writing schemas is hard & tedious

Schema languages can be hard to deal with because

they are limited (DTD)
they are complex (XML Schema)

Schemas are not a good way to model data
...for practical reasons:

schemas can be confusing to look at (XML Schema)
schema are not intelligible to non-developers

..for technical reasons:

schemas are representation- and technology-specific (they only describe XML)
XML is a tree-centric format, and so are its schema languages

Writing 5cHEMa$ is cool & g33ky!

This obviously is an engineer's view
Writing schemas should not be a goal, but a means to an end
Data modeling should be done by data modelers
A more conceptual view of the structures represented by the schemas would...

...ease understanding for non-developers
...enable focusing more on semantics
...provide a more goal-oriented approach
...allow for model integration
...lead to more platform independence

We Need a Conceptual Lodeling Layer

We need to be able to do modeling on a more abstract level
In the database world, this level is called the conceptual modeling layer

Layer	Technology
Layer	database world (SQL)	XML related
conceptual	Entity Relationship Diagrams	???
logical	DDL (`CREATE TABLE ...`) DML (`INSERT INTO ...`)	schemas XQuery
physical	table space	XML

Modeling Layers? Layering Models?

Modeling Layers — Layering Models? Modeling What is a Model?

A simplification

only consider some relevant / interesting traits, neglecting details / unneeded properties
the architectural model of the parthenon in the British Museum

An abstraction

a generalization / concept / idealization:
determine / distinguish common / defining / characteristic attributes
platonic ideas

A template

a mold / blueprint / reference example:
prescribe relevant / defining attributes
παραδειγμα stone in the ancient tunnel in Eupalinos, Samos, Greece

The former has no physical embodiment, whereas the latter have
Usually there is a one-to-many relationship between models and instances

The cardinality of the relationship model-instance can be many-to-one! Think of a toy vessel — e.g., a Titanic! But wait... which one's now the model? What is a Model? (Natural Language)

Compare the use of model in (more) natural language:

Sather tower is modeled after San Marco's campanile
The Ford Model 'T'
A fashion model →¹
This model is a model of the Golden Gate Bridge and it took me about 300over pieces...²
Anisguetzli-Model (used for a traditional swiss aniseed cookie)
Architectural models:³

Derek Zoolander: What is this? A center for ants? How can we be expected to teach children to learn how to read... if they can't even fit inside the building?
Mugatu: Derek, this is just a small...
Derek Zoolander: I don't wanna hear your excuses! The building has to be at least... three times bigger than this!

From http://www.markrobertwahlberg.com
From www.mocpages.com
From http://www.imdb.com/, with thanks to Dr.Sc. Erik Wilde.

Modeling

The process of identifying those relevant attributes and omitting the rest
The formulation (or translation) thereof in a way of description commonly used or even standardized (mapping to a meta-model)
This involves design decisions and trade-offs to be made

choosing the right granularity
flexibility vs. stringency

Modeling therefore always is...

...connected to a certain perspective
...limited to a certain scope
...having a main focus

In a certain field / realm / universe of discourse there usually is some agreement on how modeling has to be done. This is an essential prerequisite for models to be used as a subject of discussion / negotiation / evaluation. This agreement can have been achieved implicitly or by standardization. In the example above, Derek Zoolander does not know the conventions implicitly being agreed on when dealing with architectural models. Why modeling?

Get a bigger picture:

focus on relevant features
deal with data's meaning instead of representation
facilitate interaction / integration

Description: allows for

analysis & improvement
documentation
verification

Prescription: allows for

simulation
prognosis
making assumptions (e.g. when creating software processing the XML described by a schema)

Layering Layering in Computer Science

Encapsulating / hiding details / internals
Enabling working on a simpler / more goal-oriented level
Reusing frequent structures / procedures (patterns¹)
Gaining independence from specific technologies / media

Patterns are models that are sufficiently general, adaptable, and worthy of imitation that we can reuse them. (From: Glushko, Robert J. and McGrath, Tim: Document Engineering, p. 90). Identifying such patterns is a modeling task!

Layering in Computer Science: Protocol stacks

Encapsulation
Goal-orientation
Pattern reuse
Independence

The TCP/IP protocol stack

Google, a wiki, your blogging tool	Application	(1)
HTTP, FTP	Application	(3)
TCP, UDP	Transport	(2)
IP	Network	(4)
Ethernet, 802.11	MAC (Medium Access Control)
NRZ, DSSS, 16QAM	Physical
Copper wire, fiber, RF	Physical

Metaphor: Andrew S. Tanenbaum's interpreter

Layering in Computer Science: Compiler

Encapsulation
Goal-orientation
Pattern reuse
Independence

Compilers, Virtual Machines, VHDL

UML	(2)
Java	(4)
Virtual machine
C	(1)
Assembler Language
Machine Code
VHDL
Logic Gates	(3)

Metaphor: Big enterprise with strong vertical division of work

Layering in Human Physiology

Perception:

My Bicycle	Memory, social implications...
A Bicycle	Prior interpretative knowledge
Circles, lines and a diamond shape, colored	Form vision
Electrical pulses	Neural transmission
A concentration of Rhodopsin, Opsin, cGMP...	Retina: Rods & Cones
A bunch of sunrays, reflected	Pupil, eye lens

Motion:

Shake hands	Conscious action
...all the way down to...
Converting some ATP	Mitochondrions

The Combination: Model Layers!

Modeling layers: Models of different level of abstraction and / or granularity, stacked onto each other as layers
There are different classifications of such layers of modeling
There is a diligent classification used in the context of data modeling: P.P.S. Chen's Multiple Views of Data:¹

Information concerning entities and relationships in our minds
Information structure — organization in which entities and relationships are represented by data.
Access-path-independent data structure — the data structures which are not involved with search schemes, indexing schemes, etc.
Access-path-dependent data structure

Most often the coarser classification instance / logical / conceptual layer is used

Chen, Peter Pin-Shan: The Entity-Relationship Model — Towards a Unified View od Data, Cambridge MA, 1976.

Data Modeling Who's a Data Modeler?

You are Data Modelers!
You did (implicitly, and perhaps unconsciously) Data Modeling while creating the résumé DTD
Most likely you modeled your data already when creating the XML instance
In either case you most likely were thinking in some semantic structures on a more conceptual level than the final schema or instance resides
Did anybody...

...draw some trees?
...sketch some boxes?
...write down some lists?

Quality Criteria

See the quality criteria from The Good, the Bad, and the Ugly

avoid redundancy, especially within schemas
enforce reuse
be consistent
choose a reasonable level of granularity
think in terms of the logical structure (e.g., the Infoset) rather than in terms of the physical representation (e.g., the XML how it is stored in a file)

What is good data modeling?
➟ most of the above criteria are applicable to data modeling as well

An example case: Harry again Harry returns

Make up a data model for general résumés

capture logical structure rather than representation of the data
do not include information encoded in the document's structure into the instance's data (rely on that certain amount of self-desciption of XML)

In The Good, the Bad, and the Ugly, we have been told: think about working with a tree rather than working with a text file — so let's draw a tree!

The fact, that the person's name and contact information usually is given in a head section of a résumé document does not necessarily mean that such a head section is a relevant structural element: It's just a representational convention, and therefore should not be part of the data model. When creating a view of the data, we can utilize our knowledge of appropriate representational conventions by rendering personal information in a head section.
If our vocabulary contains dedicated elements for education and experience it is not necessary to include attributes or elements specifying section titles like education or experience: This information can be retrieved from the structure and again added to a specific view when being generated.

Retrieving the résumé's structure

We omit phone and email for the sake of simplicity

A well-designed DTD (1)

The order constraint in résumé probably is introduced by the schema language's limitations

SGML: & connector (not part of XML)
XML Schema: all model group

date has been made more flexible (by denoting day to be optional) in order to be useable more generally
proficiency readily can be modeled as an attribute; its value space is a good example for an enumeration

A well-designed DTD (2)

address and name are good examples for reusable elements
name is a semantically sensible container: first etc are parts-of it
the nesting startDate (or endDate, respectively) » date semantically is a less expressive relation; it merely is inserted in order to

reuse date
insert two date elements,
giving the two some reasonable (self-descriptive) names

this clearly would be a good case for using named types in XML Schema

A good instance from a well-designed DTD A look at the <em>essay</em> section

There's still redundant data!
Plain, unstructured text; hard to be interpreted for machines
Use NLP... — or improve your data model!

Critical Review: A well-designed DTD?

The DTD is not that well-designed:

regarding the markup: container elements around education and experience items would be nice
from a data perspective: resolve redundancies
conceptually: allow for representation of these semantic relations

To achieve this, we need:

A better modeling formalism
More precise quality criteria

In the world of relational databases, both of them exist

Excursus: Data Modeling in the World of Relational Databases Quality Criteria: Normal Forms

There are a well-defined quality criteria of increasing strictness, called Normal Forms¹
An informal example:

ID	Name	Study	Department
24536133	Bob	Computer Science	College of Engineering
34125004	Alice	Document Engineering	School of Information
11042019	Zlatan	Computer Science	College of Engineering

Must be resolved to:

Study	Department
Computer Science	College of Engineering
Document Engineering	School of Information

ID	Name	Study
24536133	Bob	Computer Science
34125004	Alice	Document Engineering
11042019	Zlatan	Computer Science

Yet, the most strict normal forms (4NF, 5NF) are hardly ever used in practice for the reasons mentioned earlier

Conceptual Modeling Formalism: Entity-Relationship Diagrams

There is a well-established (graphical!) formalism for conceptual modeling
Using the formalism in the correct way leads to data models complying with the quality criteria
An example:

Today, there exist several ER-dialects; UML is a superset

Conceptual Modeling for XML Data Is there anything similar for XML?

There is no established formalism; nor is there any formalism as suitable for XML as ER is for relational databases
There are scientific proposals: Several extensions of ER (XER, ERX, EER), formal grammars (XGrammar)

some of them have limited scope, some are impractical for real-world deployment, some of them have no graphical equivalent at all

Extended/restricted versions of UML

most often, UML is simply used as a drawing tool for schemas: Schemas are a bad way for conceptual modeling, UML is a bad drawing tool

Textual descriptions can be used

possibly inaccurate, translation to schemas is error-prone and not formally verifiable

In strongly data-oriented context, spreadsheets can be used (UBL)

this restricts the structural expressiveness in fundamental ways

Why is it so hard to create a suitable formalism?

XML data can be much more complex than relational data:

hierarchical structures (maybe even recursive)
mixed content
alternatives (| connector, choice model group)
order constraints
ID/IDREF constructs
faceted/enumerated content models

There are no clear quality criteria: Many things are a question of style or taste

schema languages allow for different paradigms to be used
functionally equivalent (and semantically similar) results can be achieved by quite different means: e.g. choice model group vs. substitution group
the quality of the final schema or XML may considerably depend on things beyond the scope of a conceptual model

An informal formalism

As there is no established notation, informal models may use any notation
Well then, let's draw some boxes!
But even though, let's try to do it in a systematic way:

	Determine...	Phase	Question	Example	Action
1.	Entities	Inventory	What's there?	person, company	Sketch boxes
2.	Reusable Objects	Analysis	What's there?	address, date	Perhaps include some model libraries (UBL)
3.	Reusable Tags	Markup design	What do we need?	lists, hyperlinks, headings	Perhaps include some schemas (XHTML)
4.	Relations	Assembly	What's the connection?	has-a, contains, references	Draw arcs and arrows

The résumé structure, informally formalized

This clearly is not a tree anymore
There's no specific schema language indicated
Still missing are:

constrained / faceted value spaces
order constraints
type inheritance (XML Schema)

An even better DTD An even better instance

The essay section: real mixed-content markup!

Generating Views

Utilize the (non-tree) ID/IDREF-relations in order to retrieve the data where needed:

From a well-designed data structure with rich semantic connectivity, multiple views can easily be derived
As an example, two different views have been generated using two XSLT 1.0 stylesheets

View	XSLT	HTML
Tabular	tabular.xsl	tabular.html
Textual	narrative.xsl	narrative.html

Conclusions XML and Modeling

Conceptual modeling is highly desirable
When to be used in a communication-type scenario, conceptual modeling inherently has to be relying upon an agreement on established formalisms
When to be used in a compiler-type scenario, conceptual modeling requires the availability of appropriate tools
There is no formalism really suitable for XML-centric data; there are no sophisticated tools¹
➟ There is a gap to be bridged!

Tools are evil — at least potentially: they may introduce toolchain-dependencies, they may generate ugly markup, and they may deprive us of our most beloved hobby: writing schemas.

Alternative Schema Languages — Schematron Tuesday, October 17, 2006 Chapter 4.5 (159-163) The Design of RELAX NG; Schematron While XML Schema is the most popular schema language in use today and for the foreseeable future, it is only one representative from a class of languages which are all designed for the purpose of testing whether some XML document satisfies a set of constraints. This test could of course also be conducted programmatically, but this is not portable and not easily maintainable. Schema languages thus often use a declarative approach to specifying how to conduct validation. A very simple yet very powerful language for this is Schematron, which uses the expressive power of XPath for testing whether a document satisfies a set of conditions. Schematron is rule-based in contrast to the more traditional grammar-based schema languages and complements these very well. Abstract

XML Schema Languages

XML schema languages define constraints for XML documents

defining constraints declaratively is better then writing program code
programming should be deferred as long as possible

XML schema languages validate XML documents

DTDs check XML documents against the grammar rules
XML Schemas support additional datatyping for validating contents

Applications often have many more constraints

global constraints on the characters used in the document
co-constraints which relate content to content
comparisons with external data (such as controlled lists)

Schema-Validation and Applications

Validation Pipelines

Validation is a modular task with various facets

modularization is a popular and useful principle in computer science
XML Schema is the attempt to build the one and only schema language
more modular approaches might lead to more flexible validation

Validation pipelines are useful in various scenarios

perform validation based on a sequence of basic validation tasks
make validation more configurable (partial validation)
make validation more flexible (different validation stages)

Validation pipelines can be easily implemented

programming languages support passing DOM trees as parameters
XML pipeline languages can be used to implement pipelines declaratively

Validation Pipeline Example

RELAX NG Design by Committee

XML Schema was a political decision

several schema languages were competing to replace DTDs
DCD, DDML, SOX, XML Data, and XDR were inputs to XML Schema
XML Schema became the first unreadable W3C specification
implementing XML Schema correctly is a hard (large number of specialized rules)

Researchers were looking for a more elegant solution

the underlying formalism should be well-defined and well-studied
the schema language should be easy to learn and use
lessons learned from DTDs should be included

RELAX NG is the merger of two similar approaches

Tree Regular Expressions (TREX)
Regular Language description for XML (RELAX)

RELAX NG +/-

RELAX NG and XML Schema are direct competitors
Advantages of RELAX NG

the document element is well-defined
SGML's & is supported (all is extremely limited)
non-deterministic content models

Disadvantages of RELAX NG

no datatype support (datatype libraries can be included)
no modeling facilities in the spirit of XML Schema's type derivation
less popular than XML Schema
no support for XML Schema's numeric occurrence constraints (minOccurs/maxOccurs)

Principles Validation

Validation should not change the document

there are no default values

Only schema↔instance tests are supported

there is no type hierarchy as in XML Schema (schema↔schema)
there are no identity constraints (instance↔instance)

Grammars should not be restricted

DTDs and XML Schema disallow non-determinism
RELAX NG allows non-deterministic content models
chess = white, (black, white)*, black?

Grammars

RELAX NG grammars have a start symbol

DTDs and XML Schema do not have start symbols

Attributes are defined as part of the content model

a more homogeneous view of the XML document tree
this allows alternatives of elements and attributes

Grammars are a set of named rules

rules define how an element is composed
local definitions (nested specifications of content models) are possible

Example DTD and XML Schema RELAX NG RELAX NG Compact Syntax Document Schema Definition Languages (DSDL) Modular Validation

RELAX NG gained popularity as an XML Schema alternative

RELAX NG left useful functionality out of the language
appeared as a useful addition to schema languages

Based on the idea of modular validation, DSDL was announced

DSDL should define a set of complementary schema languages
DSDL should also define a framework for applying these languages

Development and support have been slow and disappointing

RELAX NG and Schematron are successful
all other parts of DSDL are undefined or underspecified

DSDL Master Plan

DSDL is described as having the following parts

DSDL Overview
Namespace-based Validation Dispatching Language (NVDL)
Data Type Library Language (DTTL)
Path-based integrity constraints
Character Repertoire Validation Language (CRVL)
Document Schema Renaming Language (DSRL)

It is unlikely that DSDL will succeed

years of identical presentations and stalled developments
ISO is not a good place for fast-paced technologies

DSDL should be regarded as an inspiration, not as a solution

Schematron XPath Again

Schematron popularized XPath-based testing of XML documents

the language is far from being well-designed
it can be easily used to write down a number of XPath-based constraints
it can be used as an inspiration to do a better job of XPath-based testing

XPath makes it very easy to select parts of XML trees

many XSLT programs contain some validation before processing
validation and processing should be kept separate
if validation is kept separate, there may be easier ways than XSLT

Schematron has been built for human-oriented reporting

Schematron outputs are text messages that human should read
machine-oriented validation requires different features
integrating Schematron into machine-oriented pipelines requires some efforts

Basics

Schematron schemas can be regarded as scripts for XPath testing

patterns group together a set of task-oriented tests
rules define tests which have to be applied in a certain context
assertions are XPaths which are evaluated in a given context

Schematron in most cases are not covering the whole XML tree

for the rules to work, the structural integrity should be validated first
if the structure of the tree is valid, rules specify additional constraints
Schematron is a complement to grammars, not a replacement

Implementation Performing Validation

Schema languages are declarative inputs for validation

schema languages are not executable programming languages
to perform validation, some software component must process documents and schemas

Schema languages require supporting software

DTDs are part of XML, validating XML processor must perform DTD validation
XML Schema is a separate specification, an XML Schema processor is required

Schematron is built around XPaths

any technology supporting XPath evaluation would be a good foundation
XSLT is a technology supporting XPath evaluation
XSLT's program flow control is good enough to support Schematron
XSLT processors are available for a large number of platforms

XSLT-Generated XSLT

XSLT uses XML as its syntax

this is inconvenient because XSLT programs are very verbose
processing XSLT with XSLT is supported very well
for power users, the benefits outweigh the discomforts

How is it possible to generate XSLT from XSLT?

it is impossible to use literal result elements (they would be executed)
it would be against XSLT's idea to write the resulting XSLT as text
there must be a distinction between executable and output XSLT elements

<xsl:template match="rule">
	<xsl:template match="{@context}">
		<xsl:apply-templates select="assert"/>
	</xsl:template>
</xsl:template>

XSLT-Based Schematron

Compiling Assertions Compiled Example Patterns Grouping Tests

Patterns are containers for a set of

patterns are used for representing goal-oriented parts of the validation
achieving one goal may require checking within various contexts

Patterns are described by a title and additional text

Schematron is geared towards human users
title and text are documentation only, they are never used for validation

Patterns can be grouped by phases for different validation tasks

patterns group a set of rules specific for one validation goal
depending on the application, different validation phases may require different sets of patterns

Rules Setting the Context

Setting the context is essential for XPath expressions

within , rules group context-specific
assertion XPaths are evaluated relative to a rule's context

Abstract rules make is possible to reuse assertions

abstract rules are not evaluated (they do not have a context)
other rules may import assertions by extending an abstract rule

Assertions Assertions with <code>assert</code>

assert is used to specify assertions

if the XPath evaluates to false, the assertion's content is output
assertion are always evaluated as boolean (type casting will be applied)

Assertion XPaths are evaluated relative to the containing rule's context

moving an assertion from one rule to another will change its meaning

XPath is not good for expressing grammar rules

grammar checking should be left to grammar-oriented languages

<!ELEMENT ENTRY (NAME, ADDRESS, PHONENUM+, EMAIL) >

( count(NAME) = 1 and count(ADDRESS) = 1 and count(EMAIL) = 1 ) and ( NAME[following-sibling::ADDRESS] and ADDRESS[following-sibling::PHONENUM] and PHONENUM[following-sibling::EMAIL] ) and ( count(NAME|ADDRESS|PHONENUM|EMAIL) = count(*) )

Assertions with <code>report</code>

report is used to generate reports

if the XPath evaluates to true, then the assertion's content is output
assertion are always evaluated as boolean (type casting will be applied)

Logically, assert and report are inverse

assert is used to test conformance (it outputs errors)
report id used to report observations (it outputs messages)
Schematron's processing model is underspecified (check assertions, print outputs)

Schematron is useful for reporting to humans

machine-oriented environments need a better processing model
using Schematron as a starting point could be a good way to start

Report Example Conclusions Validation is Good

Validation is better than writing code
Validation should be seen as a pipeline process
RELAX NG can be a useful and simple substitute for XML Schema
Schematron supports XPath-oriented constraints for XML documents

XML and Database Systems Thursday, October 19, 2006 XML and Databases XML Programming with SQL/XML and XQuery; SQL/XML; XML Query XML is the most popular data format for exchanging data, but the majority of data within applications and closed systems is still stored in Relational Database Managements Systems (RDBMS). This leads to two main issues, the first one being how moving data between XML formats and RDBMS can be done easily and efficiently, so that moving data between these two worlds can be done as easy as possible. The second issue is how to map the data models between these two worlds. Relational data can easily be represented in XML, because tables can be easily represented in trees. Things can be more complicated in the other direction, because arbitrary XML can be hard to store in a relational database. For XML-centric scenarios, XML Database Management Systems (XDBMS) are an interesting alternative, which provide XML-specific query capabilities with XML Query (XQuery). Abstract

XML is Trees

XML documents are trees

applications may have different internal data models (mapped to trees for interfacing)
the exchange and processing of XML documents is tree-based

Where and how is XML being used?

as a pure transfer syntax (Web Services very often are used like this)
as artifacts that have a longer lifespan (archiving of XML business documents)
as the applications data model (there is nothing but XML)

XML usage results in very different requirements for XML tools

Web Service programmers often never see the tree
archived XML documents need to be searchable
XML-centric applications need to store XML efficiently

Storing XML

XML documents are text files

they can be stored in file systems (they are self-describing)
they can be retrieved by searching through the file system

File systems are not designed to store millions of documents

standard file system implementation usually slow down dramatically
standard procedures (backup/restore/concurrency) do not work well

Problems with File Systems as XML Databases

the number of documents is too large
there is no structured access (XPath Shell (XPsh) provides an XML-find/grep)
there is no access optimization (XPsh is very slow)

XML and Databases Data needs Databases

Databases are what should be used for data storage

they provide much better querying and performance than file systems
databases are the foundation of all IT systems

Databases are designed around a set of assumptions

data must adhere to the data model of the database
databases can only work well if the data is modeled well
for databases is an essential part of an IT infrastructure

Data modeling is based on a meta model

a model can only represent what is supported by the underlying modeling language
XML is a model in itself: partially ordered trees with typed nodes (and constraints)

Model Mismatches

Simple models can be expressed in rich metamodels

features not required for the model can be ignored
the simple model can be expressed using a subset of the metamodel
e.g., DTDs can be expressed in the XML Schema metamodel

Complex models cannot be (adequately) expressed in simple metamodels

if the model is richer than the metamodel, information is lost
the unsupported parts can be piggybacked onto other mechanism
piggybacking technically works, but it is brittle, inefficient, and a sign
e.g., XML Schema's type information cannot be expressed in DTD
parts of it can be mapped to

Relational Databases Generic XML Storage

Relational databases are the state of the art since 1976

this is long enough to build highly optimized and robust systems
this is long enough to have ER hard-wired into some brains

XML is more powerful than ER

repetitions of elements do not map well
choices do not map well
ordered content does not map well
mixed content does not map well

Storing XML in a relational database is hard

it can be done by piggybacking structural information as content
using the resulting structures is awkward and very inefficient

Tree Table

ID	Type	Name	Value	Parent	Left
1	Root
2	Element	a		1
3	Element	b		2
4	Element	c		2	3
5	Text		Text	3
6	Attribute	att	42	4

Database Support for XML Why XML and Databases?

XML is constantly getting more popular

XML as a document format was first used as wire format only
instead of parsing manually, parser interfaces provide better XML support
data binding frameworks bind XML even more tightly into applications
if all programs somehow hide the XML, why not work on XML directly?

What is XML for an application?

an (increasingly popular) way to represent the data?
the data itself?
currently, the representation perspective is more popular
as XML is increasingly penetrating applications, this may change

XML Interchange

XML Support in DBMS

XML DBMS

XML Storage in Databases Model Mapping

Relational databases are not good tools for storing XML

they might be appropriate if the schema disallows problematic constructs
they often are already deployed and applications must live with them

If the data model is ER-oriented, relational databases are good tools

XML may be invisible from the model point of view
parts of the model may be encoded as an XML schema

If the XML is not visible in the model, it can be structurally inaccessible

e.g., a product catalog may contain product descriptions in XHTML rich text snippets
for managing the product catalog data, the XHTML is not relevant

If the XML is part of the model, it should be accessible structurally

if the product catalog XHTML contains links to other products, these links are important
they could be extracted (creating redundant and hard to maintain data)
if they are hidden in the XHTML, all XHTML snippets have to be parsed
ideally, the database should be able to query the XHTML snippet

XML is Text

XML documents can be stored as text

databases typically have various datatypes for text storage
if the database supports Unicode, any XML document can be stored

The XML structure is completely invisible to the database

working with the XML requires querying and parsing the XML text
this kind of storage does not allow any querying of the XML content

XML → ∗LOB

XML as a Datatype

SQL supports a wide variety of datatypes

typed values are better than untyped values (they enable type-specific operations)
XML can be regarded as just another data type

Introducing a datatype lets the database recognize the data

XML data can be stored in some format (a persistent DOM)
databases can provide functionality avoiding parsing/serialization (DOM-based)

XML Datatype

Mapping XML to Models

Model-relevant data must be mapped to the database structures

this assumes there is a ER-model which describes the database structure
mapping XML is easy by definition because the XML is ER-compliant

Is the data accessed as table data?

if shredded data is only used to assemble it again, it is just performance overhead
if shredded data is accessed relationally, then shredding makes sense

Shredding (XML → Columns)

XML as First-Class Citizen

The defines XML as a sub-concept of ER

the overall structure of the database is relational
attributes may be of type XML, which means storing trees in tables

Tables are not the only way to see the world

XML trees are an alternative to tables, not a datatype
XML-centric applications should not be forced to use tables at all

XML can be regarded as replacing the ER-concept altogether

the database simply stores XML documents
applications can store, query, update, and manage XML documents in the database

XML DBMS

XML in Relational Databases RDBish XML

XML schemas can be designed with databases in mind

avoid unbounded repetitions of elements
avoid choices
avoid ordered content
avoid mixed content

Many XML schemas are designed RDBish for compatibility reasons

it was decided that the XML should enable an easy mapping to relational structures
the person designing the schema has a ER-structured brain
the schema has been generated from a relational database schema

Problematic XML

XML in its full glory is too much for tables

XML has been developed as a document format
XML is about hierarchy (which intentionally have been left out of ER)
XML is about highly irregular structures

XML often is said to have to flavors

data-oriented XML: regular data which can be easily mapped to tables
document-oriented XML: irregular structures which are hard to map to tables
real-world XML often is a bit of both (e.g., content and metadata)

Hybrid approaches sometimes are a good solution

data-oriented can be shredded and stored in tables
the document-oriented rest is stored as one object (text or XML)

SQL/XML SQL/XML:2003

SQL/XML provides s

it introduces
it introduces a number of operations for generating XML from query results
it defines mappings to bridge both worlds (SQL and XML)

SQL/XML does not change anything about the database model

data is still stored in tables only
a column of a table may use the XML type
queries may return results in XML rather than as SQL result sets

SQL/XML Example

SELECT
	e.EmpId,
	e.FirstName,
	e.LastName,
	e.StartDate,
	e.EndDate
FROM Employees e WHERE e.EmpId = 12

SELECT
	XMLELEMENT (NAME "employee",
		XMLATTRIBUTES(e.EmpId as "id"),
		XMLELEMENT(NAME "names",
		XMLELEMENT(NAME "first", e.FirstName),
		XMLELEMENT(NAME "last", e.LastName)),
		XMLELEMENT(NAME "hire-dates",
			XMLATTRIBUTES(e.StartDate as "start", e.EndDate as "end")))
FROM Employees e WHERE e.EmpId = 12

SQL/XML:2007

Adds the concept of XML Tables
XML Tables are not tables, they are containers for XML
SQL/XML:2007 changes the database's data model

it is now possible to have a database with no tables
likely use cases are to have both: traditional and XML tables

SQL/XML:2007 defines a hybrid database: relational and XML database

XML Databases Storing XML

XML documents are text documents
Infosets are abstractions of XML documents (information loss)
XPath node trees are abstractions of Infosets (more information loss)
XML Schema adds useful information to an Infoset (type annotations)
XQuery 1.0 and XPath 2.0 Data Model (XDM)

a new creation of the W3C for XQuery and XPath (2.0)
Infoset + Types + Sequences
considerably more complicated than the XPath 1.0 node tree model

XML Query Language (XQuery)

XQuery is designed for querying XML databases

XQuery works on a set of documents
it returns results as XDM instances (sequences or XML documents)

XQuery is built on top of XPath 2.0

XPath 2.0 is a much more powerful language than XPath 1.0
XPath 2.0 is still limited to selecting parts of an XML document
XQuery provides facilities to work on multiple documents
XQuery provides facilities to construct results

XQuery is comparable to XSLT 2.0

both are built on top of XPath 2.0
both are easy to learn if you already know XPath 2.0
both can be used to process documents to yield results
XSLT is for programmers, XQuery is for SQL-users

XQuery Example Conclusions Tables and Trees don't Mix

Tables and trees are different data models
Different technologies are used to handle these different models
Think before choosing the wrong tool

Database Technologies do Mix

Relational databases are good tools for regular data
XML databases are good tools for document-oriented XML
SQL/XML:2007 defines a database that does both
Applications can choose the best mix of tables and trees

XML Trends & Developments Tuesday, October 24, 2006 W3C XML Activity Statement XML is a very basic technology for representing trees using a standardized markup-based syntax. An increasing number of technologies are building on this foundation, creating an expanding field of XML-based technologies for interoperability in many different fields. Application-specific XML-based data formats are used in many different settings, and the best data format for a given scenario depends on the existing formats in this area and the exact requirements. More interestingly, generic XML technologies which can be applied in many different settings make it easier for developers and system integrators to achieve their goal of making system interoperate. Abstract

Course Evaluation

15min to fill out the evaluation forms
Please take your time and make detailed comments
Your comments help to improve the course next year

Web Services XML-Based Distributed Programming

XML exchanges often have to be negotiated in advance

the transport mechanism need to be defined
the schema(s) need to be defined
the possible interactions between peers need to be defined

Web Services are a well-defined environment for XML exchanges
Two very different approaches to XML-based distributed programming

instead of programming-language specific mechanisms, have you components talk to each other in XML
instead of simply wrapping APIs in XML, redesign your IT landscape into loosely coupled systems

Web Service Technologies

Simple Object Access Protocol (SOAP)

SOAP messages have an envelope for Web Service Metadata
SOAP messages have a body containing the actual payload
non-XML data can be attached in the same way as for e-mail messages

Web Service Description Language (WSDL) for describing SOAP-based services

the payload format must be known
the different messages that may be sent must be known
the transport mechanism must be known
the address where to send the SOAP to must be known

Universal Description, Discovery, and Integration (UDDI) for making WSDL available

UDDI is intended to be a repository for WSDL descriptions
UDDI is not a global service like the DNS
UDDI is modeled after yellow pages

SOAP Example Message WSDL Example (Google) UDDI Data Model

XForms HTML Forms Limitations

HTML forms are very popular for data entry

many Web-based applications use HTML forms as their interface
the features offered by HTML forms are very poor

HTML forms have a lot of limitations

they cannot check datatypes (fields are always strings)
they cannot create new fields (if data entry requires repeatable fields)
they only work in HTML (integral part of the HTML language)

Workarounds for better Web-based applications are possible

JavaScript can be used to provide additional functionality
server-side engines can provide a back-end for better forms
writing accessible, portable, and usable forms is a challenge

XForms

XML is the most ubiquitous data format on the Web

there is no generally support way to edit or produce XML data
forms should be XML-based rather than being based on HTTP/MIME

XForms define an XML-based model for data editing and input

they are separating content from presentation
clients are free in their choice of data presentation and acquisition
XForms provide an XML-in, XML-out model of data handling
XForms can be implemented server- or client-based

Client-based XForms require browser support
Server-based XForms require XForms↔DHTML mappings

XForms Limitations

XForms are good for data-oriented XML

regularly structured data
no mixed content

XForms are inappropriate for document-oriented XML

irregularly structured data is not well-supported
mixed content is not supported at all

XForms is for forms, it is not a general XML editing facility

XML editors often need a lot of customization
there is no standards-based way for general-purpose XML editing

XPath 2.0 XML Databases

is a very active field

XML users want to store their XML in something better than a file system
there must be a way to retrieve XML from this storage
an XML-specific query language would be the ideal tool

is a good foundation for a query language

XPath 1.0 is too simple to be useful (and has no type support)
XPath 2.0 is a much bigger language (many more functions)
XPath 2.0 is a more powerful language (more expressive power)

XPath 2.0 is the new foundation for XML technologies

is an improved version of XSLT 1.0
is a query language for XML
both languages are built on top of XPath 2.0

XSLT 2.0 XSLT Improvements

XSLT 2.0 has grown by around 40%
Some of the most limiting aspects have been removed

grouping is now part of the language (iterate over groups of nodes)
stylesheets can now produce more than one result document
stylesheets can now read (and tokenize) text files

XSLT 2.0 can be used in conjunction with XML Schema

the input document is validated and type-annotated
the result document is validated while being constructed

<!-- tokenize the input file by lines and output them as newline-separated list of strings. -->
<xsl:variable name="listing" select="string-join(tokenize(unparsed-text($fileuri, 'UTF-8'), '\r?\n'), '&#xa;')"/>
<xsl:value-of select="if (@tab eq 'retain') then $listing else replace($listing, '\t', ' ')"/>

XPath vs. XSLT

XPath 2.0 has grown by around 70%
XPath 2.0 has many more features than XPath 1.0
More problems can be solved in XPath directly
XSLT programming is more powerful and more challenging

more ways to solve the same problem
favoring XPath over XSLT is a matter of style (and robustness and maintainability)

<xsl:value-of select="if ( @gender = 'male' ) then 'Sir' else 'Madam'"/>

<xsl:choose>
  <xsl:when test="@lang = 'en'">English</xsl:when>
  <xsl:when test="@lang = 'de'">Deutsch</xsl:when>
  <xsl:when test="@lang = 'fr'">Français</xsl:when>
  <xsl:otherwise>n/a</xsl:otherwise>
</xsl:choose>

XQuery XML Query Language (XQuery)

XQuery is designed for querying XML databases

XQuery works on a set of documents
it returns results as XDM instances (sequences or XML documents)

XQuery is built on top of XPath 2.0

XPath 2.0 is a much more powerful language than XPath 1.0
XPath 2.0 is still limited to selecting parts of an XML document
XQuery provides facilities to work on multiple documents
XQuery provides facilities to construct results

XQuery is comparable to XSLT 2.0

both are built on top of XPath 2.0
both are easy to learn if you already know XPath 2.0
both can be used to process documents to yield results
XSLT is for programmers, XQuery is for SQL-users

XQuery Example Semantic Web XML is Syntax

XML facilitates the exchange of trees
XML schema languages define constraints for trees
The meaning of the data encoded in the tree is unclear

XML has no semantics (with the exception of xml:lang)
semantics have to be agreed upon before cooperation is possible
XML relies on other mechanisms (documentation, formal models)

Semantics

Semantics can be defined in ontologies

ontologies are a formalization of a conceptualization

By referring to ontologies, cooperation can use shared semantics

of course, this only works if people first agree on the ontology
domain specialists build ontologies, which are then used for semantics

Semantic Web technologies revolve around the idea of ontologies

Resource Description Framework (RDF) annotations describe resources semantically
the ontology is defined using the Web Ontology Language (OWL)
all kinds of AI-style applications are possible using formalized semantics

Conclusions XML is Growing

XML is the foundation for structured information
XML is getting closer to programming languages
XML is becoming the standard toolset for any kind of structured information
XML itself is simple, but using XML wisely is not always simple
Schemas and documents may live very long, so plan ahead and choose wisely

Discussion

Syntax vs. Model (i.e., brackets vs. Infoset)
XSLT before XML Schema
Schema languages (XML Schema vs. alternatives)
XSLT 2.0 and XPath 2.0 (and XQuery)
Course length vs. content (full-semester course?)