Abstract

While XML Transformations (XSLT) 1.0 has become a successful programming language widely used for transforming XML documents, its limitations sometimes make it difficult to use XSLT in a good way. An important reason for many of the limitations is the fact that XSLT 1.0 has been designed as a client-side language. Building on XSLT 1.0 and XPath 2.0, XML Transformations (XSLT) 2.0 improves the language in a variety of ways.

XSLT 1.0 Restrictions

XSLT as a client-side language
- XSLT 1.0 was designed to run in a browser (similar to CSS)
- XSLT today is almost never used as a client-side language
Processing model geared towards client-side usage
- there always is one input document and one output document
- runtime errors have to be avoided as much as possible
Data types and XML
- XML is a very weakly typed language (strings, IDs, IDREFs)
- any application data types must be implemented in application code

XSLT 2.0 Improvements

XSLT as a server-side language
- XSLT 2.0 better supports server-side usage
- native XSLT support in browsers might never happen reliably
- shipping XML and transforming it in the browser is not required very often
Processing model extended to better support server-side usage
- there can be more than one output document
- runtime errors can be a very valuable tool for detecting program errors
XSDL introduces a datatype model for XML
- Simple Types provide a basic vocabulary of datatypes
- many XPath 2.0 functions support working with the simple types
- Complex Types allow the definition of structured types
- type checking is supported for simple and complex types

Outline (Multiple Result Documents)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

One XML, Many HTML

The original model of XSLT 1.0 was a 1:1 mapping of XML and HTML
- a browser retrieves an XML document an generates HTML from it
- this assumed that the granularity of XML is the same as for HTML
XML documents often represent complex information
- in many cases this is too much information to be displayed on just one HTML page
- typically the complex model of XML is mapped to interlinked HTML
HTML generated from XML can reflect many different views
- one HTML for each core concept of the XML information model
- indices that make available other HTML through faceted lists
- table of contents using various concepts for listing entries
- alternative representations for core concepts (various dimensions possible)

Generating HTML Pages

  <xsl:for-each select="//reference">
   <xsl:result-document method="xhtml" href="reference/{@name}.html">
    <html>
     <head>
      <title><xsl:value-of select="title"/></title>
      <link rel="stylesheet" type="text/css" href="../reference.css"/>
     </head>
     <body>
      <div class="navigation">
       <a href="{ if ( position() ne 1 ) then preceding-sibling::reference[1]/@name else following-sibling::reference[last()]/@name }.html">Previous</a> |
       <a href="../reference-list.html">Index</a> |
       <a href="{ if ( position() ne last() ) then following-sibling::reference[1]/@name else preceding-sibling::reference[last()]/@name }.html">Next</a>
      </div>
      <h2><xsl:value-of select="title"/></h2>
      <xsl:apply-templates select="names[@type='author']/*"/>
      <h4><a href="../reference-list.html#year{substring(date/@value, 1, 4)}"><xsl:value-of select="myns:format-date(date/@value)"/></a></h4>
      <xsl:if test="abstract">
       <div class="abstract"><xsl:copy-of select="abstract/richtext/*"/></div>
      </xsl:if>
     </body>
    </html>
   </xsl:result-document>
  </xsl:for-each>

Outline (Creating Identifiers)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Navigable Hypertext

Create as many hypertext links as possible
- styling should make sure that hyperlink formatting does not degrade legibility
- use different styles for essential links and ancillary links
- ancillary links should not use mystery meat navigation, but something close
Purely generated pages get generated names
- table of contents and listings based on various criteria
- index pages for better access to page contents
Pages representing core concepts should get identifier names
- these identifiers should be stable so that bookmarks do not break
- they can reuse XML identifiers, derive identifiers from content, or generate random identifiers
- a well-defined and stable URI naming policy is important

Reuse Existing Identifiers

Many core concepts in XML documents have identifiers
- are these identifiers mandatory?
- are these identifiers a good choice for HTML page names?
- sometimes simple string functions can help to create better identifiers
URI design is a core part of REST and essential to good Web architecture
True REST design will also allow the creation of new resources
- it is possible to PUT new resources into existing collections
- it is possible to POST new resources to existing collections
- PUT and POST are different with regard to the resource name
These identifiers are a core part of the application data model

Generate Content-Based Identifiers

Sometimes more speaking identifiers are required
- easier to understand when looking at the identifier and the URI
- often there is a danger of name clashes
Blogs often use a combination of the post date and the title
- dates should appear as hierarchical path segments such as 2007/10/25
- titles are appended by matching the post title to URI syntax (replace and truncate)
- name clashes can only occur on the same day using a very similar title
- date navigation can be used to provide access to date-based index pages
Generated identifiers should be stable (name clashes should not break them)

Generate Random Identifiers

Sometimes it may not be required or possible to reuse data for identifiers
- this may be true if there is no identifier and no main property
- generated identifiers can be design to be very compact
Random identifiers should use some pseudo-random algorithm
- one possible solution is a fingerprint algorithm such as MD5
- another solution is a really random solution such as TinyURL
It is necessary to keep track of the generated identifiers
- collisions are possible (in particular in case of short random values)
- in case of a collision an alternative identifier must be assigned

Using Existing Identifiers

  <xsl:result-document method="xhtml" href="reference-list.html">
   <html>
    <head>
     <title>Reference List</title>
     <link rel="stylesheet" type="text/css" href="reference.css"/>
    </head>
    <body>
     <h2>Reference List</h2>
     <xsl:for-each-group select="//reference" group-by="substring(date/@value, 1, 4)">
      <xsl:sort select="current-grouping-key()"/>
      <h5 id="year{current-grouping-key()}"><xsl:value-of select="current-grouping-key()"/></h5>
      <ul>
       <xsl:for-each select="current-group()">
        <xsl:sort select="title"/>
        <li><a href="reference/{@name}.html"><xsl:value-of select="title"/></a></li>
       </xsl:for-each>
      </ul>
     </xsl:for-each-group>
    </body>
   </html>
  </xsl:result-document>

Outline (Text Processing)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Text Processing in XSLT 1.0

XPath 1.0 provides a small number of string functions
- the selection of functions is very limited and sometimes restrictive
- more advanced functionality is not available (in particular, no Regular Expressions)
Text documents cannot be processed at all in XSLT 1.0
- XSLT 1.0 assumes that valuable input data always is XML
- text is a straightforward extension of the XSLT processing model
- binary data access would require a much bigger change of the language
XSLT 2.0 extends XSLT to support import and export
- XSLT 1.0 already supports text document as an output format
- XSLT 2.0 now supports text documents as an input format

Outline (Accessing Text Documents)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Non-XML in an XML World

Many tools produce text-based output
- text structures are much simpler and often lossy
- at least some data can be used and reused
unparsed-text() reads a text-based document
- returns a string containing the complete input document
- optionally, an encoding can be specified (UTF-8 is the default)
Text documents often also are structured documents
- text uses sentences and paragraphs (empty lines) and maybe other markup
- text formats often use commas or semicolon or spaces or tabs for structures
- XSLT 2.0's text transformation features support working with these structures

Comma-Separated Values (CSV)

RFC 4180 defines a textual format for spreadsheet data
CSV has been used for a long time, but some of the details were solved differently
Defining a media type makes it easier for implementations to know what to expect
- the CSV registration not only registers the type, but also defines it
CSV is not overly complex, but some issues have to be solved
- how to separate lines (CRLF)
- how to end the file (CRLF is allowed but optional)
- are headers allowed (yes, but they are not marked as such)
- may different lines use different numbers of fields (no)
- are spaces significant (yes)
- are quotes significant (no, they are delimiters, so quotes as values must be escaped)
- how to treat fields with CRLF, commas, or quotes (enclose the value in quotes)

Outline (Transforming Text)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Regular Expressions

XPath is great for navigating through an XML tree
- all relevant structures of XML are represented and can be navigated
- XPath 1.0 has some very basic string functions to work with content
- XPath 2.0 adds many more string functions
XPath 2.0 extends the regular expression syntax of XSDL
- the usual basic expressions known from many languages and tools
- ^ and $ for matching beginnings and ends (of strings or lines)
- XPath 2.0 supports reluctant quantifiers (indicated by a ? following a quantifier)
- allows access to sub-expressions (important for selective replace() of substrings)
- allows back-references within expressions (references captured substrings)
XPath 2.0 supports regular expressions in three functions
- XSLT 2.0 adds an instruction for parsing strings

Matching & Replacing

matches() tests whether a string matches a given pattern

an optional flag allows different processing options

matches("abracadabra", "bra")    eq true()

matches("abracadabra", "^a.*a$") eq true()

matches("abracadabra", "^bra")   eq false()

replace() selectively replaces parts of the input string

supports the same flag as the matches() function

replace("abracadabra", "bra", "*")              eq "a*cada*"

replace("abracadabra", "a.*a", "*")             eq "*"

replace("abracadabra", "a.*?a", "*")            eq "*c*bra"

replace("abracadabra", "a(.)", "a$1$1")         eq "abbraccaddabbra"

replace("abracadabra", "^(.*?)b(.*)$", "$1c$2") eq "acracadabra"

Tokenizing

tokenize() turns a string into a sequence of strings
- supports the transition from text structures to the XDM concept of sequences
- supports the same flag as the matches() and replace() functions

Tokenization is based on the concept of pattern-based structures

input strings are using some recognizable way of separating substrings

a pattern can be used to find substrings and return them as a sequence

tokenize("just plain  text", "\s+") eq ( "just", "plain", "text" )

tokenize("1,15,,24,50,", ",")       eq ( "1", "15", "", "24", "50", "" )

tokenize("HTML <BR> tag<br />soup", "\s*<br\s*/?>\s*", "i") eq ("HTML", "tag", "soup")

Analyzing Strings

XPath functions work on a string and return strings or sequences
analyze-string executes XSLT code for parts of the string
- XSLT code can create elements and/or attributes based on string input
- transforming text in XML often is referred to as up-conversion
two children contain code for handling the parsing process
- matching-substring is executed for each matching part
- non-matching-substring is executed for each non-matching part
- both of these elements are optional
- if two adjacent matching substrings are found, matching-substring is called twice

Replacing Characters with Elements

Replace all newline characters in the abstract element by br/ elements

<xsl:analyze-string select="abstract" regex="\n">
  <xsl:matching-substring>
    <br/>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

Replacing Character Markup

Turn textual conventions into XML markup
- citations are using […] for the citation identification

<xsl:analyze-string select="body" regex="\[(.*?)\]">
  <xsl:matching-substring>
    <cite><xsl:value-of select="regex-group(1)"/></cite>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

Outline (Conclusions)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

A Better XSLT

Multiple result documents can generate Web sites from one XML
Highly interlinked hypertext can be produced by adding HTML links
Text processing opens a new possibility for XSLT processing
Regular expression support allows flexible processing of text documents
XPath 2.0 and XSLT 2.0 support pattern-based text processing

XML Transformations (XSLT) 2.0 – Part I

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2007-10-16

Abstract

XSLT 1.0 Restrictions

XSLT 2.0 Improvements

Outline (Multiple Result Documents)

One XML, Many HTML

Generating HTML Pages

Outline (Creating Identifiers)

Navigable Hypertext

Reuse Existing Identifiers

Generate Content-Based Identifiers

Generate Random Identifiers

Using Existing Identifiers

Outline (Text Processing)

Text Processing in XSLT 1.0

Outline (Accessing Text Documents)

Non-XML in an XML World

Comma-Separated Values (CSV)

Outline (Transforming Text)

Regular Expressions

Matching & Replacing

Tokenizing

Analyzing Strings

Replacing Characters with Elements

Replacing Character Markup

Outline (Conclusions)

A Better XSLT

XML Transformations (XSLT) 2.0 – Part I

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information2007-10-16

Abstract

XSLT 1.0 Restrictions

XSLT 2.0 Improvements

Outline (Multiple Result Documents)

One XML, Many HTML

Generating HTML Pages

Outline (Creating Identifiers)

Navigable Hypertext

Reuse Existing Identifiers

Generate Content-Based Identifiers

Generate Random Identifiers

Using Existing Identifiers

Outline (Text Processing)

Text Processing in XSLT 1.0

Outline (Accessing Text Documents)

Non-XML in an XML World

Comma-Separated Values (CSV)

Outline (Transforming Text)

Regular Expressions

Matching & Replacing

Tokenizing

Analyzing Strings

Replacing Characters with Elements

Replacing Character Markup

Outline (Conclusions)

A Better XSLT

Erik Wilde, UC Berkeley School of Information
2007-10-16