XML Path Language (XPath) 2.0

XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)

Erik Wilde, UC Berkeley School of Information
2011-10-06

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: XML Path Language (XPath) 2.0

Contents

E. Wilde: XML Path Language (XPath) 2.0

(2) Abstract

The XML Path Language (XPath) is one of the most useful and frequently used languages in the are of XML technologies. In its version 1.0, it is used in technologies such as XSLT, XSD, DOM, and XML Tools. With XPath 2.0, the language has been greatly extended, the new version of XPath is the foundation for XSLT 2.0 and XQuery. XPath 2.0 provides support for regular expression matching, typed expressions, and contains language constructs for conditional and repeated evaluation.



XQuery 1.0 and XPath 2.0 Data Model (XDM)

Outline (XQuery 1.0 and XPath 2.0 Data Model (XDM))

  1. XQuery 1.0 and XPath 2.0 Data Model (XDM) [12]
    1. Sets vs. Sequences [2]
    2. Comparisons [4]
    3. Working with Sequences [5]
    4. Working with Values [1]
  2. Readable Syntax [6]
    1. Conditional Expressions [2]
    2. Iterations [2]
    3. Quantified Expressions [1]
  3. Conclusions [1]

Sets vs. Sequences

Sets vs. Sequences E. Wilde: XML Path Language (XPath) 2.0

(5) XPath 1.0 Sets

  • XPath 1.0 has a very simple data model of four types
    1. node sets: //img[not(@alt)]
    2. number: count(//img)
    3. string: /descendant::img[3]/@src
    4. boolean: starts-with(/html/@lang, 'en')
  • When XPath 1.0 was created, the XML world was untyped
    • XML documents contain content in text nodes and attribute values
    • XPath introduced its humble world of three datatypes
  • Dealing with types in XSLT 1.0 is very unpleasant
    • all datatypes beyond the basic types must be implemented by hand
    • all operations on these types must be implemented as well
    • EXSLT [http://www.exslt.org/] collects modules for frequently used datatypes


Sets vs. Sequences E. Wilde: XML Path Language (XPath) 2.0

(6) XPath 2.0 Sequences

  • XSD introduces the concept of typed data to the XML world
    • one part of XSD is its ability to validate documents
    • the other part of XSD is the fact that validation produces type annotations
  • Sequences are XPath's mechanism where these types show up
  • XPath 2.0 needs a more powerful model for its advanced functionality
    • everything in XPath 2.0 is a sequence (of typed items)
    • sequences can contain a mix of items of various types
    • sequences cannot be nested (there are no sequences of sequences)
  • Sequences replace node sets, which in XDM do not exist anymore
    • Sequences replace node-sets from XPath 1.0. In XPath 1.0, node-sets do not contain duplicates. In generalizing node-sets to sequences in XPath 2.0, duplicate removal is provided by functions on node sequences. (XDM [http://www.w3.org/TR/xpath-datamodel/#sequences])


Comparisons

Comparisons E. Wilde: XML Path Language (XPath) 2.0

(8) General Comparisons

= != < <= > >=
  • XPath 1.0 only has these operators
    • they are defined to work on any of the four datatypes
    • node set comparisons are defined in a rather complex way [http://www.w3.org/TR/xpath#booleans]
    • in particular, XPath 1.0 comparisons often involve type casting
  • XPath 2.0 introduces Value Comparisons [Value Comparisons (1)] for comparing atomic values
    • they are introduced to provide a set of operators with less surprises
    • the original XPath 1.0 operators are redefined to work on sequences
  • General comparisons can be expressed using Quantified Expressions [Quantified Expressions (1)]
    • potentially a large number of comparisons
      $X = $Y
      some $x in $X, $y in $Y satisfies $x eq $y


Comparisons E. Wilde: XML Path Language (XPath) 2.0

(9) Value Comparisons

eq ne lt le gt ge
  • These operators have been introduced by XPath 2.0
    • they work on single values only
    • they should be used except when sequences are allowed as operands
  • The value comparison operators also have built-in type conversion rules
    • prior to anything else, both operands are atomized
    • comparing with an empty sequence always yields an empty sequence
    • comparing with a sequence with more than one item yields an error
    • after that, the values are converted to a common type


Comparisons E. Wilde: XML Path Language (XPath) 2.0

(10) Node Comparisons

is << >>
  • Comparing nodes by identity or document order
    • node identity is very cumbersome to test in XPath 1.0
    • XPath 2.0 makes axis support optional
    • some XQuery implementations do not support preceding* and following*
  • $a is $b is true only if both variables identify the same node
    • when processing documents, identity often is more relevant than equality
    • much better than XPath 1.0's generate-id($a) = generate-id($b)
  • $a << $b is true if $a precedes $b in document order
    • precedence (as in the preceding axis) excludes containment
    • if the nodes are in different documents, the result is undefined


Comparisons E. Wilde: XML Path Language (XPath) 2.0

(11) Some Surprises

  • Sequences make some things more complicated than atomic values
  • $X = $X is not always true
    • if $X is the empty sequence, there are no equal items
  • $X != 'test' and not($X = 'test') are not the same
    • $X != 'test' is true if one item in $X is not equal to 'test'
    • not($X = 'test') is true if no item in $X is equal to 'test'
    • the classical case are optional parts: @mode != 'test' is false if there is no @mode!
    • it is generally a good idea to avoid != and use not() and =
  • $X = $Y and $Y = $Z does not imply $X = $Z
    • the reason is that comparisons are done pairwise (the comparisons are sets of comparisons)
    • (1, 2), (2, 3), and (3, 4) illustrate this behavior
    • = only tests for partial equality (one item must be equal)


Working with Sequences

Working with Sequences E. Wilde: XML Path Language (XPath) 2.0

(13) Testing Sequence Cardinality

  • Testing for empty sequences
    empty(()) = true()
  • Testing for non-empty sequences
    exists((1, 2, 3)) = true()
  • Cleaner code for conditional expressions
    • good code should not rely on implicit type conversions
      if ( exists(@email) ) then …
      if ( empty(@email) ) then …


Working with Sequences E. Wilde: XML Path Language (XPath) 2.0

(14) Set Operations on Sequences

  • Merging two node sets (no duplicates, document order)
    () | ()
  • Intersecting two node sets (no duplicates, document order)
    () intersect ()
  • Subtracting two node sets (no duplicates, document order)
    () except ()
  • Comparing sequences item by item for deep equality
    deep-equal((1, 2, 3), (1, 3, 2)) ≡ false()


Working with Sequences E. Wilde: XML Path Language (XPath) 2.0

(15) Manipulating Sequences (I)

  • Concatenating sequences
    ((1, 2, 3), (4, 5, 6)) ≡ (1, 2, 3, 4, 5, 6)
  • Reversing sequences
    reverse((1, 2, 3, 4)) ≡ (4, 3, 2, 1)
  • Finding items in sequences
    index-of((1, 2, 3, 1), 1) ≡ (1, 4)
  • Cutting sub-sequences out of sequences
    subsequence((1, 2, 3, 4, 5, 6, 7), 5, 2) ≡ (5, 6)


Working with Sequences E. Wilde: XML Path Language (XPath) 2.0

(16) Manipulating Sequences (II)

  • Inserting items into sequences
    insert-before(("one", "two", "four"), 3, "three") ≡ ("one", "two", "three", "four")
  • Removing items from sequences
    remove(("white", "white", "black", "white"), 3) ≡ ("white", "white", "white")
  • Removing duplicates from a sequence
    distinct-values((1, 2, 3, 1, 2, 6, 7)) ≡ (1, 2, 3, 6, 7)
  • Help your optimizer!
    unordered((1, 2, 3, 4, 5)) ≡ (3, 4, 1, 2, 5)


Working with Sequences E. Wilde: XML Path Language (XPath) 2.0

(17) Aggregating Sequences

  • Counting the number of items in a sequence
    count((1, 2, 3, 4, 5, 6)) ≡ 6
  • Calculating the average (the types must be compatible)
    avg((1, 2, 3, 4, 5, 6)) ≡ 3.5
  • Getting maximum or minimum values from a sequence (the types must be compatible)
    max($seq) ge min($seq)
  • Calculating the sum of sequence items (the types must be compatible)
    sum(1 to 42) ≡ 903


Working with Values

Working with Values E. Wilde: XML Path Language (XPath) 2.0

(19) Type Casting

  • Values often are untyped
    • they may be part of a schema-less document
    • they may be extracted as substring of some text value
    • XSLT 2.0 allows to read text files (these texts are never typed)
  • For intermediate results typed values may be advantageous
    • certain operations are only possible on typed values
    • code using typed values usually is more robust
  • XPath 2.0 has several ways to handle types and instances
    42 instance of xs:integer
    '2007-02-13' castable as xs:date
    '2007-02-13' cast as xs:date
    if ( $i castable as xs:… ) then $i cast as xs:… else ()


Readable Syntax

Outline (Readable Syntax)

  1. XQuery 1.0 and XPath 2.0 Data Model (XDM) [12]
    1. Sets vs. Sequences [2]
    2. Comparisons [4]
    3. Working with Sequences [5]
    4. Working with Values [1]
  2. Readable Syntax [6]
    1. Conditional Expressions [2]
    2. Iterations [2]
    3. Quantified Expressions [1]
  3. Conclusions [1]
Readable Syntax E. Wilde: XML Path Language (XPath) 2.0

(21) Easier to Understand

<listing src="xlinked-class.xml" line="81-98"/>
string-join(tokenize( if ( exists(@encoding) ) then unparsed-text($fileuri, @encoding) else unparsed-text($fileuri), '\r?\n')[(position() ge number(tokenize(current()/@line, '\-')[1])) and (position() le number(tokenize(current()/@line, '\-')[2]))], '&#xa;')


Conditional Expressions

Conditional Expressions E. Wilde: XML Path Language (XPath) 2.0

(23) Control Flow in XPath

  • XPath 1.0 expressions control flow is based on predicates
    • the results of location path steps are filtered by predicates
    • this can be used to emulate control flow
    • this technique is limited because it can only be applied to nodes
  • XPath 2.0 introduces conditional expressions
    • a condition is given which is interpreted as a boolean
    • based on the result, either the then or the else part is evaluated
    • the else part may not be omitted
if ( … ) then … else …
if ( @sex eq 'm' ) then 'Sir' else 'Madam'
if ( @sex eq 'm' ) then 'Sir' else if ( @sex eq 'f' ) then 'Madam' else 'Whatever'


Conditional Expressions E. Wilde: XML Path Language (XPath) 2.0

(24) Less XSLT

<names>
 <name>
  <first>Erik</first>
  <last>Wilde</last>
 </name>
 <name>
  <last>Hasan</last>
 </name>
</names>
first | last[not(../first)]
<xsl:variable name="name">
	<xsl:choose>
		<xsl:when test="first">
			<xsl:value-of select="first"/>
		</xsl:when>
		<xsl:otherwise>
			<xsl:value-of select="last"/>
		</xsl:otherwise>
	</xsl:choose>
</xsl:variable>
if ( exists(first) ) then first else last


Iterations

Iterations E. Wilde: XML Path Language (XPath) 2.0

(26) Repeating Expression Evaluation

  • Iteration repeatedly applies an expression to a sequence of items
    • the notion of sequences is central to this concept
    • this requires variables for binding and evaluation
  • Iterations clearly demonstrate the change in expressiveness
    • they introduce functionality which previously was limited to host languages
for $… in … return …
for $i in //name return $i/last
for $i in //name return if ( exists($i/first) ) then $i/first else $i/last


Iterations E. Wilde: XML Path Language (XPath) 2.0

(27) Iterations vs. Location Paths

  • Every location path can be written using iterations
    /names/name/last
    for $i in /names return for $j in $i/name return $j/last
  • Iterations are a more generalized way of evaluation
    • path expressions work on nodes only
      for $i in 1 to 10 return $i
    • path expression sort by document order and eliminate duplicates
      //last/../..
      for $i in //last return for $j in $i/.. return $j/..
    • location steps change the context, iterations use the variable for this purpose
  • Location paths are a useful syntax and method for tree navigation


Quantified Expressions

Quantified Expressions E. Wilde: XML Path Language (XPath) 2.0

(29) Testing Sequences

  • Testing whether some or all items of a sequence satisfy a condition
    • the notion of sequences is central to this concept
    • this requires variables for binding and evaluation
  • Quantifiers are well-known from query languages
    • some iterates over items and succeeds after the first success
    • every iterates over items and fails after the first failure
    • both constructs are good candidates for optimization
( some | every ) $… in … satisfies …
some $i in //*[@xlink:type='locator']/@xlink:href satisfies $i eq $query-uri
every $i in //li/@id satisfies //*[@xlink:type='locator'][@xlink:href=concat('#', $i)]


Conclusions

Outline (Conclusions)

  1. XQuery 1.0 and XPath 2.0 Data Model (XDM) [12]
    1. Sets vs. Sequences [2]
    2. Comparisons [4]
    3. Working with Sequences [5]
    4. Working with Values [1]
  2. Readable Syntax [6]
    1. Conditional Expressions [2]
    2. Iterations [2]
    3. Quantified Expressions [1]
  3. Conclusions [1]
Conclusions E. Wilde: XML Path Language (XPath) 2.0

(31) Easy Transition



2011-10-06 XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)