Abstract

While XPath 2.0 syntactically is an extension of XPath 1.0, the underlying data model has changed quite radically. Instead of XPath 1.0's simple concept of four datatypes (node set, number, string, boolean), the XQuery 1.0 and XPath 2.0 Data Model (XDM) is based on sequences and allows much more sophisticated ways of data representation and manipulation. Furthermore, XDM includes the datatypes defined by XSDL, which results in an complex and powerful collection of built-in datatypes and operations on these datatypes.

Outline (Sets vs. Sequences)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

XPath 1.0 Sets

XPath 1.0 has a very simple data model of four types
1. node sets: //img[not(@alt)]
2. number: count(//img)
3. string: /descendant::img[3]/@src
4. boolean: starts-with(/html/@lang, 'en')
When XPath 1.0 was created, the XML world was untyped
- XML documents contain content in text nodes and attribute values
- XPath introduced its humble world of three datatypes
Dealing with types in XSLT 1.0 is very unpleasant
- all datatypes beyond the basic types must be implemented by hand
- all operations on these types must be implemented as well
- EXSLT collects modules for frequently used datatypes

XPath 2.0 Sequences

XSDL introduces the concept of typed data to the XML world
- one part of XSDL is its ability to validate documents
- the other part of XSDL is the fact that validation produces type annotations
Sequences are XPath's mechanism where these types show up
XPath 2.0 needs a more powerful model for its advanced functionality
- everything in XPath 2.0 is a sequence (of typed items)
- sequences can contain a mix of items of various types
- sequences cannot be nested (there are no sequences of sequences)
Sequences replace node sets, which in XDM do not exist anymore
- Sequences replace node-sets from XPath 1.0. In XPath 1.0, node-sets do not contain duplicates. In generalizing node-sets to sequences in XPath 2.0, duplicate removal is provided by functions on node sequences. (XDM)

Outline (Comparisons)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

General Comparisons

= != < <= > >=

XPath 1.0 only has these operators
- they are defined to work on any of the four datatypes
- node set comparisons are defined in a rather complex way
- in particular, XPath 1.0 comparisons often involve type casting
XPath 2.0 introduces Value Comparisons for comparing atomic values
- they are introduced to provide a set of operators with less surprises
- the original XPath 1.0 operators are redefined to work on sequences
General comparisons can be expressed using Quantified Expressions
- potentially a large number of comparisons
```
$X = $Y
```
```
some $x in $X, $y in $Y satisfies $x eq $y
```

Value Comparisons

eq ne lt le gt ge

These operators have been introduced by XPath 2.0
- they work on single values only
- they should be used except when sequences are allowed as operands
The value comparison operators also have built-in type conversion rules
- prior to anything else, both operands are atomized
- comparing with an empty sequence always yields an empty sequence
- comparing with a sequence with more than one item yields an error
- after that, the values are converted to a common type

Node Comparisons

is << >>

Comparing nodes by identity or document order
- node identity is very cumbersome to test in XPath 1.0
- XPath 2.0 makes axis support optional
- some XQuery implementations do not support preceding* and following*
$a is $b is true only if both variables identify the same node
- when processing documents, identity often is more relevant than equality
- much better than XPath 1.0's generate-id($a) = generate-id($b)
$a << $b is true if $a precedes $b in document order
- precedence (as in the preceding axis) excludes containment
- if the nodes are in different documents, the result is undefined

Some Surprises

Sequences make some things more complicated than atomic values
$X = $X is not always true
- if $X is the empty sequence, there are no equal items
$X != 'test' and not($X = 'test') are not the same
- $X != 'test' is true if one item in $X is not equal to 'test'
- not($X = 'test') is true if no item in $X is equal to 'test'
- the classical case are optional parts: @mode != 'test' is false if there is no @mode!
- it is generally a good idea to avoid != and use not() and =
$X = $Y and $Y = $Z does not imply $X = $Z
- the reason is that comparisons are done pairwise (the comparisons are sets of comparisons)
- (1, 2), (2, 3), and (3, 4) illustrate this behavior
- = only tests for partial equality (one item must be equal)

Outline (Available Datatypes)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

XSDL Everywhere

XSDL is the foundation for many XML technologies today
- it is a complex standard that few people really understand
- nevertheless, the W3C hardwires it into many new specifications
XSDL has two parts
- Part 1 defines the structures defining XML documents and schemas as a whole
- Part 2 defines an extensible datatype library based on a set of built-in datatypes
XSDL does two things
- it defines how documents are validated against a schema by inspecting the document
- it defines how documents are annotated with the results of the validation process
XDM is based on annotated documents
- documents without annotations are just a special case (everything is untyped)
- XDM has been built around the assumption that people use schemas

Simplified XDM Type Hierarchy

Atomic Types

XSDL simple types and XPath's atomic types are similar
- simple types can be atomic, union, or list types
- XPath only considers XSDL's atomic types as its own atomic types
XPath adds four more types to XSDL's simple types
- dayTimeDuration and yearMonthDuration for better duration handling
- anyAtomicType and untypedAtomic for the type hierarchy
Type-based processing is only available with schema support
- XSLT 2.0 distinguishes basic and schema-aware XSLT processors
- XQuery 1.0 distinguishes minimal, schema importing, and schema validating XQuery processors

Outline (Working with Sequences)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

Testing Sequence Cardinality

Testing for empty sequences
```
empty(()) = true()
```
Testing for non-empty sequences
```
exists((1, 2, 3)) = true()
```
Cleaner code for conditional expressions
- good code should not rely on implicit type conversions
```
if ( exists(@email) ) then …
```
```
if ( empty(@email) ) then …
```

Set Operations on Sequences

Merging two node sets (no duplicates, document order)
```
() | ()
```
Intersecting two node sets (no duplicates, document order)
```
() intersect ()
```
Subtracting two node sets (no duplicates, document order)
```
() except ()
```
Comparing sequences item by item for deep equality
```
deep-equal((1, 2, 3), (1, 3, 2)) = false()
```

Manipulating Sequences (I)

Concatenating sequences

((1, 2, 3), (4, 5, 6)) = (1, 2, 3, 4, 5, 6)

Reversing sequences
```
reverse((1, 2, 3, 4)) = (4, 3, 2, 1)
```
Finding items in sequences
```
index-of((1, 2, 3, 1), 1) = (1, 4)
```

Cutting sub-sequences out of sequences

subsequence((1, 2, 3, 4, 5, 6, 7), 5, 2) = (5, 6)

Manipulating Sequences (II)

Inserting items into sequences

insert-before(("one", "two", "four"), 3, "three") = ("one", "two", "three", "four")

Removing items from sequences

remove(("white", "white", "black", "white"), 3) = ("white", "white", "white")

Removing duplicates from a sequence

distinct-values((1, 2, 3, 1, 2, 6, 7)) = (1, 2, 3, 6, 7)

Help your optimizer!

unordered((1, 2, 3, 4, 5)) = (3, 4, 1, 2, 5)

Aggregating Sequences

Counting the number of items in a sequence
```
count((1, 2, 3, 4, 5, 6)) = 6
```
Calculating the average (the types must be compatible)
```
avg((1, 2, 3, 4, 5, 6)) = 3.5
```
Getting maximum or minimum values from a sequence (the types must be compatible)
```
max($seq) ge min($seq)
```
Calculating the sum of sequence items (the types must be compatible)
```
sum(1 to 42) = 903
```

Outline (Working with Values)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

Type Casting

Values often are untyped
- they may be part of a schema-less document
- they may be extracted as substring of some text value
- XSLT 2.0 allows to read text files (these texts are never typed)
For intermediate results typed values may be advantageous
- certain operations are only possible on typed values
- code using typed values usually is more robust

XPath 2.0 has several ways to handle types and instances

42 instance of xs:integer

'2007-02-13' castable as xs:date

'2007-02-13' cast as xs:date

if ( $i castable as xs:… ) then $i cast as xs:… else ()

Outline (Conclusions)

Sets vs. Sequences [2]
Comparisons [4]
Available Datatypes [5]
Working with Sequences [5]
Working with Values [1]
Conclusions [3]

Advanced Selections

XPath 2.0 is a powerful language for selection in XML
XDM provides the sequence model as a foundation
Functions and operators allow advanced sequence handling
XPath 2.0 takes some time to get used to it
Problems can be used in a variety of ways

Sample XML

 <reference name="kau03">
  <names type="author">
   <person>
    <givenname>Roland</givenname>
    <surname>Kaufmann</surname>
   </person>
  </names>
  <date value="2003-05"/>
  <abstract>
   <richtext>
    <p>The work presented here extends an existing algorithm for testing if an inclusion relation exists between two markup schemata, to only take into account the parts of the grammar that have been used in a given subset of its language. Statistics for this purpose are gathered in combination with validation when documents are entered and are stored along with them in the repository. This modified subtyping relation is used to determine compatibility with the current database when a schema is upgraded.</p>
   </richtext>
  </abstract>
  <address>Bergen, Norway</address>
  <publisher type="university">University of Bergen</publisher>
  <title>Efficiently Locating Schema Incompatibilities in an Extensible Markup Language</title>
  <annotation>
   <richtext>
    <p>Keywords: <keywordref type="topic-xmlschema" weight="0.8"/>; </p>
   </richtext>
  </annotation>
  <identifier type="uri" resourceType="application/pdf">www.ub.uib.no/elpub/2003/h/413001/Hovedoppgave.pdf</identifier>
  <howpublished>Ph.D. Thesis</howpublished>
 </reference>

Questions

Find all publications published in 2004 (58 / )
- date/@value can be YYYY[-MM[-DD]]
Find the last names of all XML-oriented authors (48 / )
- keywordref/@type must be set to topic-xml (ignore the @weight)
- authors should only be counted once (name clashes are out of scope for this question)
Find all references where at least two authors have the same given name (8 / )
- descendant::givenname is a safe way to find all given names for a reference
- given names cannot be repeated (the schema does not allow repetition)
Which publications dated 2000 or later have been updated? (48 / )
- xref[@type eq 'updates'] points to updated references (to their @name)
- strings can be compared with Value Comparisons
What is the average number of authors per publication? (1.795138 / )
- reference/names can have name or person children (), count both

XQuery 1.0 and XPath 2.0 Data Model (XDM)

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2007-10-11

Abstract

Outline (Sets vs. Sequences)

XPath 1.0 Sets

XPath 2.0 Sequences

Outline (Comparisons)

General Comparisons

Value Comparisons

Node Comparisons

Some Surprises

Outline (Available Datatypes)

XSDL Everywhere

XSDL Type Hierarchy

XDM Type Hierarchy

Simplified XDM Type Hierarchy

Atomic Types

Outline (Working with Sequences)

Testing Sequence Cardinality

Set Operations on Sequences

Manipulating Sequences (I)

Manipulating Sequences (II)

Aggregating Sequences

Outline (Working with Values)

Type Casting

Outline (Conclusions)

Advanced Selections

Sample XML

Questions

XQuery 1.0 and XPath 2.0 Data Model (XDM)

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information2007-10-11

Abstract

Outline (Sets vs. Sequences)

XPath 1.0 Sets

XPath 2.0 Sequences

Outline (Comparisons)

General Comparisons

Value Comparisons

Node Comparisons

Some Surprises

Outline (Available Datatypes)

XSDL Everywhere

XSDL Type Hierarchy

XDM Type Hierarchy

Simplified XDM Type Hierarchy

Atomic Types

Outline (Working with Sequences)

Testing Sequence Cardinality

Set Operations on Sequences

Manipulating Sequences (I)

Manipulating Sequences (II)

Aggregating Sequences

Outline (Working with Values)

Type Casting

Outline (Conclusions)

Advanced Selections

Sample XML

Questions

Erik Wilde, UC Berkeley School of Information
2007-10-11