XML Transformations (XSLT) 2.0 – Part I

XML Foundations [./]
Fall 2009 — INFO 242 (CCN 42575)

Erik Wilde, UC Berkeley School of Information
2009-10-20

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

Contents

E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(2) Abstract

While XML Transformations (XSLT) 1.0 has become a successful programming language widely used for transforming XML documents, its limitations sometimes make it difficult to use XSLT in a good way. An important reason for many of the limitations is the fact that XSLT 1.0 has been designed as a client-side language. Building on XSLT 1.0 and XPath 2.0, XML Transformations (XSLT) 2.0 improves the language in a variety of ways.



E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(3) XSLT 1.0 Restrictions



E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(4) XSLT 2.0 Improvements



Multiple Result Documents

Outline (Multiple Result Documents)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Multiple Result Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(6) One XML, Many HTML



Multiple Result Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(7) Generating HTML Pages

  <xsl:for-each select="//reference">
   <xsl:result-document method="xhtml" href="reference/{@name}.html">
    <html>
     <head>
      <title><xsl:value-of select="title"/></title>
      <link rel="stylesheet" type="text/css" href="../reference.css"/>
     </head>
     <body>
      <div class="navigation">
       <a href="{ if ( position() ne 1 ) then preceding-sibling::reference[1]/@name else following-sibling::reference[last()]/@name }.html">Previous</a> |
       <a href="../reference-list.html">Index</a> |
       <a href="{ if ( position() ne last() ) then following-sibling::reference[1]/@name else preceding-sibling::reference[last()]/@name }.html">Next</a>
      </div>
      <h2><xsl:value-of select="title"/></h2>
      <xsl:apply-templates select="names[@type='author']/*"/>
      <h4><a href="../reference-list.html#year{substring(date/@value, 1, 4)}"><xsl:value-of select="myns:format-date(date/@value)"/></a></h4>
      <xsl:if test="abstract">
       <div class="abstract"><xsl:copy-of select="abstract/richtext/*"/></div>
      </xsl:if>
     </body>
    </html>
   </xsl:result-document>
  </xsl:for-each>


Creating Identifiers

Outline (Creating Identifiers)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(9) Navigable Hypertext

  • Create as many hypertext links as possible
    • styling should make sure that hyperlink formatting does not degrade legibility
    • use different styles for essential links and ancillary links
    • ancillary links should not use mystery meat navigation [http://www.webpagesthatsuck.com/mysterymeatnavigation00.html], but something close
  • Purely generated pages get generated names
    • table of contents and listings based on various criteria
    • index pages for better access to page contents
  • Pages representing core concepts should get identifier names


Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(10) Reuse Existing Identifiers

  • Many core concepts in XML documents have identifiers
    • are these identifiers mandatory?
    • are these identifiers a good choice for HTML page names?
    • sometimes simple string functions can help to create better identifiers
  • URI design is a core part of REST [../web-fall09/rest] and essential to good Web architecture
  • True REST design will also allow the creation of new resources
    • it is possible to PUT new resources into existing collections
    • it is possible to POST new resources to existing collections
    • PUT and POST are different with regard to the resource name
  • These identifiers are a core part of the application data model


Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(11) Generate Content-Based Identifiers

  • Sometimes more speaking identifiers are required
    • easier to understand when looking at the identifier and the URI
    • often there is a danger of name clashes
  • Blogs often use a combination of the post date and the title
    • dates should appear as hierarchical path segments such as 2007/10/25
    • titles are appended by matching the post title to URI syntax (replace and truncate)
    • name clashes can only occur on the same day using a very similar title
    • date navigation can be used to provide access to date-based index pages
  • Generated identifiers should be stable (name clashes should not break them)


Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(12) Generate Random Identifiers

  • Sometimes it may not be required or possible to reuse data for identifiers
    • this may be true if there is no identifier and no main property
    • generated identifiers can be design to be very compact
  • Random identifiers should use some pseudo-random algorithm
    • one possible solution is a fingerprint algorithm such as MD5 [http://www.miraclesalad.com/webtools/md5.php]
    • another solution is a really random solution such as TinyURL [http://tinyurl.com/] or is.gd [http://is.gd/]
  • It is necessary to keep track of the generated identifiers
    • collisions are possible (in particular in case of short random values)
    • in case of a collision an alternative identifier must be assigned


Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(13) Using Existing Identifiers

  <xsl:result-document method="xhtml" href="reference-list.html">
   <html>
    <head>
     <title>Reference List</title>
     <link rel="stylesheet" type="text/css" href="reference.css"/>
    </head>
    <body>
     <h2>Reference List</h2>
     <xsl:for-each-group select="//reference" group-by="substring(date/@value, 1, 4)">
      <xsl:sort select="current-grouping-key()"/>
      <h5 id="year{current-grouping-key()}"><xsl:value-of select="current-grouping-key()"/></h5>
      <ul>
       <xsl:for-each select="current-group()">
        <xsl:sort select="title"/>
        <li><a href="reference/{@name}.html"><xsl:value-of select="title"/></a></li>
       </xsl:for-each>
      </ul>
     </xsl:for-each-group>
    </body>
   </html>
  </xsl:result-document>


Text Processing

Outline (Text Processing)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Text Processing E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(15) Text Processing in XSLT 1.0



Accessing Text Documents

Outline (Accessing Text Documents)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Accessing Text Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(17) Non-XML in an XML World

  • Many tools produce text-based output
    • text structures are much simpler and often lossy
    • at least some data can be used and reused
  • unparsed-text() reads a text-based document
    • returns a string containing the complete input document
    • optionally, an encoding can be specified (UTF-8 is the default)
  • Text documents often also are structured documents
    • text uses sentences and paragraphs (empty lines) and maybe other markup
    • text formats often use commas or semicolon or spaces or tabs for structures
    • XSLT 2.0's text transformation features [Transforming Text (1)] support working with these structures


Accessing Text Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(18) Comma-Separated Values (CSV)

  • RFC 4180 [http://dret.net/rfc-index/reference/RFC4180] defines a textual format for spreadsheet data
  • CSV has been used for a long time, but some of the details were solved differently
  • Defining a media type makes it easier for implementations to know what to expect
    • the CSV registration not only registers the type, but also defines it
  • CSV is not overly complex, but some issues have to be solved
    • how to separate lines (CRLF)
    • how to end the file (CRLF is allowed but optional)
    • are headers allowed (yes, but they are not marked as such)
    • may different lines use different numbers of fields (no)
    • are spaces significant (yes)
    • are quotes significant (no, they are delimiters, so quotes as values must be escaped)
    • how to treat fields with CRLF, commas, or quotes (enclose the value in quotes)


Transforming Text

Outline (Transforming Text)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(20) Regular Expressions

  • XPath [XML Path Language (XPath)] is great for navigating through an XML tree
  • XPath 2.0 extends the regular expression syntax [http://www.w3.org/TR/xmlschema-2/#regexs] of XSD [XSD – Part I]
    • the usual basic expressions known from many languages and tools
    • ^ and $ for matching beginnings and ends (of strings or lines)
    • XPath 2.0 supports reluctant quantifiers (indicated by a ? following a quantifier)
    • allows access to sub-expressions (important for selective replace() of substrings)
    • allows back-references within expressions (references captured substrings)
  • XPath 2.0 supports regular expressions in three functions
    • XSLT 2.0 adds an instruction for parsing strings [xslt20-string-analyzation]


Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(21) Matching & Replacing

  • matches() tests whether a string matches a given pattern
    • an optional flag allows different processing options [http://www.w3.org/TR/xpath-functions/#flags]
      matches("abracadabra", "bra")    ≡ true()
      matches("abracadabra", "^a.*a$") ≡ true()
      matches("abracadabra", "^bra")   ≡ false()
  • replace() selectively replaces parts of the input string
    • supports the same flag as the matches() function
      replace("abracadabra", "bra", "*")              ≡ "a*cada*"
      replace("abracadabra", "a.*a", "*")             ≡ "*"
      replace("abracadabra", "a.*?a", "*")            ≡ "*c*bra"
      replace("abracadabra", "a(.)", "a$1$1")         ≡ "abbraccaddabbra"
      replace("abracadabra", "^(.*?)b(.*)$", "$1c$2") ≡ "acracadabra"


Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(22) Tokenizing

  • tokenize() turns a string into a sequence of strings
    • supports the transition from text structures to the XDM concept of sequences
    • supports the same flag as the matches() and replace() functions
  • Tokenization is based on the concept of pattern-based structures
    • input strings are using some recognizable way of separating substrings
    • a pattern can be used to find substrings and return them as a sequence
      tokenize("just plain  text", "\s+") ≡ ( "just", "plain", "text" )
      tokenize("1,15,,24,50,", ",")       ≡ ( "1", "15", "", "24", "50", "" )
      tokenize("HTML <BR> tag<br />soup", "\s*<br\s*/?>\s*", "i") ≡ ("HTML", "tag", "soup")


Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(23) Analyzing Strings

  • XPath functions work on a string and return strings or sequences
  • analyze-string executes XSLT code for parts of the string
    • XSLT code can create elements and/or attributes based on string input
    • transforming text in XML often is referred to as up-conversion
  • two children contain code for handling the parsing process
    • matching-substring is executed for each matching part
    • non-matching-substring is executed for each non-matching part
    • both of these elements are optional
    • if two adjacent matching substrings are found, matching-substring is called twice


Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(24) Replacing Characters with Elements

  • Replace all newline characters in the abstract element by br/ elements
<xsl:analyze-string select="abstract" regex="\n">
  <xsl:matching-substring>
    <br/>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>


Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(25) Replacing Character Markup

  • Turn textual conventions into XML markup
    • citations are using […] for the citation identification
<xsl:analyze-string select="body" regex="\[(.*?)\]">
  <xsl:matching-substring>
    <cite><xsl:value-of select="regex-group(1)"/></cite>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>


Conclusions

Outline (Conclusions)

  1. Multiple Result Documents [7]
    1. Creating Identifiers [5]
  2. Text Processing [9]
    1. Accessing Text Documents [2]
    2. Transforming Text [6]
  3. Conclusions [1]
Conclusions E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(27) A Better XSLT



2009-10-20 XML Foundations [./]
Fall 2009 — INFO 242 (CCN 42575)