Multiple Result Documents

Outline (Multiple Result Documents)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Multiple Result Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(6) One XML, Many HTML

The original model of XSLT 1.0 was a 1:1 mapping of XML and HTML
- a browser retrieves an XML document an generates HTML from it
- this assumed that the granularity of XML is the same as for HTML
XML documents often represent complex information
- in many cases this is too much information to be displayed on just one HTML page
- typically the complex model of XML is mapped to interlinked HTML
HTML generated from XML can reflect many different views
- one HTML for each core concept of the XML information model
- indices that make available other HTML through faceted lists
- table of contents using various concepts for listing entries
- alternative representations for core concepts (various dimensions possible)

Multiple Result Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(7) Generating HTML Pages

  <xsl:for-each select="//reference">
   <xsl:result-document method="xhtml" href="reference/{@name}.html">
    <html>
     <head>
      <title><xsl:value-of select="title"/></title>
      <link rel="stylesheet" type="text/css" href="../reference.css"/>
     </head>
     <body>
      <div class="navigation">
       <a href="{ if ( position() ne 1 ) then preceding-sibling::reference[1]/@name else following-sibling::reference[last()]/@name }.html">Previous</a> |
       <a href="../reference-list.html">Index</a> |
       <a href="{ if ( position() ne last() ) then following-sibling::reference[1]/@name else preceding-sibling::reference[last()]/@name }.html">Next</a>
      </div>
      <h2><xsl:value-of select="title"/></h2>
      <xsl:apply-templates select="names[@type='author']/*"/>
      <h4><a href="../reference-list.html#year{substring(date/@value, 1, 4)}"><xsl:value-of select="myns:format-date(date/@value)"/></a></h4>
      <xsl:if test="abstract">
       <div class="abstract"><xsl:copy-of select="abstract/richtext/*"/></div>
      </xsl:if>
     </body>
    </html>
   </xsl:result-document>
  </xsl:for-each>

references.xsl (line 4-26)

Creating Identifiers

Outline (Creating Identifiers)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(9) Navigable Hypertext

Create as many hypertext links as possible
- styling should make sure that hyperlink formatting does not degrade legibility
- use different styles for essential links and ancillary links
- ancillary links should not use mystery meat navigation [http://www.webpagesthatsuck.com/mysterymeatnavigation00.html], but something close
Purely generated pages get generated names
- table of contents and listings based on various criteria
- index pages for better access to page contents
Pages representing core concepts should get identifier names
- these identifiers should be stable so that bookmarks do not break
- they can reuse XML identifiers [Reuse Existing Identifiers (1)], derive identifiers from content [Generate Content-Based Identifiers (1)], or generate random identifiers [Generate Random Identifiers (1)]
- a well-defined and stable URI naming policy is important

Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(10) Reuse Existing Identifiers

Many core concepts in XML documents have identifiers
- are these identifiers mandatory?
- are these identifiers a good choice for HTML page names?
- sometimes simple string functions can help to create better identifiers
URI design is a core part of REST [../web-fall09/rest] and essential to good Web architecture
True REST design will also allow the creation of new resources
- it is possible to PUT new resources into existing collections
- it is possible to POST new resources to existing collections
- PUT and POST are different with regard to the resource name
These identifiers are a core part of the application data model

Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(11) Generate Content-Based Identifiers

Sometimes more speaking identifiers are required
- easier to understand when looking at the identifier and the URI
- often there is a danger of name clashes
Blogs often use a combination of the post date and the title
- dates should appear as hierarchical path segments such as 2007/10/25
- titles are appended by matching the post title to URI syntax (replace and truncate)
- name clashes can only occur on the same day using a very similar title
- date navigation can be used to provide access to date-based index pages
Generated identifiers should be stable (name clashes should not break them)

Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(12) Generate Random Identifiers

Sometimes it may not be required or possible to reuse data for identifiers
- this may be true if there is no identifier and no main property
- generated identifiers can be design to be very compact
Random identifiers should use some pseudo-random algorithm
- one possible solution is a fingerprint algorithm such as MD5 [http://www.miraclesalad.com/webtools/md5.php]
- another solution is a really random solution such as TinyURL [http://tinyurl.com/] or is.gd [http://is.gd/]
It is necessary to keep track of the generated identifiers
- collisions are possible (in particular in case of short random values)
- in case of a collision an alternative identifier must be assigned

Creating Identifiers E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(13) Using Existing Identifiers

  <xsl:result-document method="xhtml" href="reference-list.html">
   <html>
    <head>
     <title>Reference List</title>
     <link rel="stylesheet" type="text/css" href="reference.css"/>
    </head>
    <body>
     <h2>Reference List</h2>
     <xsl:for-each-group select="//reference" group-by="substring(date/@value, 1, 4)">
      <xsl:sort select="current-grouping-key()"/>
      <h5 id="year{current-grouping-key()}"><xsl:value-of select="current-grouping-key()"/></h5>
      <ul>
       <xsl:for-each select="current-group()">
        <xsl:sort select="title"/>
        <li><a href="reference/{@name}.html"><xsl:value-of select="title"/></a></li>
       </xsl:for-each>
      </ul>
     </xsl:for-each-group>
    </body>
   </html>
  </xsl:result-document>

references.xsl (line 27-47)

Text Processing

Outline (Text Processing)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Text Processing E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(15) Text Processing in XSLT 1.0

XPath 1.0 provides a small number of string functions [http://www.w3.org/TR/xpath#section-String-Functions]
- the selection of functions is very limited and sometimes restrictive
- more advanced functionality is not available (in particular, no Regular Expressions [Regular Expressions (1)])
Text documents cannot be processed at all in XSLT 1.0
- XSLT 1.0 assumes that valuable input data always is XML
- text is a straightforward extension of the XSLT processing model
- binary data access would require a much bigger change of the language
XSLT 2.0 extends XSLT to support import and export
- XSLT 1.0 already supports text document as an output format
- XSLT 2.0 now supports text documents as an input format

Accessing Text Documents

Outline (Accessing Text Documents)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Accessing Text Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(17) Non-XML in an XML World

Many tools produce text-based output
- text structures are much simpler and often lossy
- at least some data can be used and reused
unparsed-text() reads a text-based document
- returns a string containing the complete input document
- optionally, an encoding can be specified (UTF-8 is the default)
Text documents often also are structured documents
- text uses sentences and paragraphs (empty lines) and maybe other markup
- text formats often use commas or semicolon or spaces or tabs for structures
- XSLT 2.0's text transformation features [Transforming Text (1)] support working with these structures

Accessing Text Documents E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(18) Comma-Separated Values (CSV)

RFC 4180 [http://dret.net/rfc-index/reference/RFC4180] defines a textual format for spreadsheet data
CSV has been used for a long time, but some of the details were solved differently
Defining a media type makes it easier for implementations to know what to expect
- the CSV registration not only registers the type, but also defines it
CSV is not overly complex, but some issues have to be solved
- how to separate lines (CRLF)
- how to end the file (CRLF is allowed but optional)
- are headers allowed (yes, but they are not marked as such)
- may different lines use different numbers of fields (no)
- are spaces significant (yes)
- are quotes significant (no, they are delimiters, so quotes as values must be escaped)
- how to treat fields with CRLF, commas, or quotes (enclose the value in quotes)

Transforming Text

Outline (Transforming Text)

Multiple Result Documents [7]
1. Creating Identifiers [5]
Text Processing [9]
1. Accessing Text Documents [2]
2. Transforming Text [6]
Conclusions [1]

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(20) Regular Expressions

XPath [XML Path Language (XPath)] is great for navigating through an XML tree
- all relevant structures of XML are represented and can be navigated
- XPath 1.0 [XML Path Language (XPath)] has some very basic string functions [http://www.w3.org/TR/xpath#section-String-Functions] to work with content
- XPath 2.0 [XML Path Language (XPath) 2.0] adds many more string functions [http://www.w3.org/TR/xpath-functions/#string-functions]
XPath 2.0 extends the regular expression syntax [http://www.w3.org/TR/xmlschema-2/#regexs] of XSD [XSD – Part I]
- the usual basic expressions known from many languages and tools
- ^ and $ for matching beginnings and ends (of strings or lines)
- XPath 2.0 supports reluctant quantifiers (indicated by a ? following a quantifier)
- allows access to sub-expressions (important for selective replace() of substrings)
- allows back-references within expressions (references captured substrings)
XPath 2.0 supports regular expressions in three functions
- XSLT 2.0 adds an instruction for parsing strings [xslt20-string-analyzation]

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(21) Matching & Replacing

matches() tests whether a string matches a given pattern
- an optional flag allows different processing options [http://www.w3.org/TR/xpath-functions/#flags]
```
matches("abracadabra", "bra")    ≡ true()
```
```
matches("abracadabra", "^a.*a$") ≡ true()
```
```
matches("abracadabra", "^bra")   ≡ false()
```

replace() selectively replaces parts of the input string

supports the same flag as the matches() function

replace("abracadabra", "bra", "*")              ≡ "a*cada*"

replace("abracadabra", "a.*a", "*")             ≡ "*"

replace("abracadabra", "a.*?a", "*")            ≡ "*c*bra"

replace("abracadabra", "a(.)", "a$1$1")         ≡ "abbraccaddabbra"

replace("abracadabra", "^(.*?)b(.*)$", "$1c$2") ≡ "acracadabra"

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(22) Tokenizing

tokenize() turns a string into a sequence of strings
- supports the transition from text structures to the XDM concept of sequences
- supports the same flag as the matches() and replace() functions

Tokenization is based on the concept of pattern-based structures

input strings are using some recognizable way of separating substrings

a pattern can be used to find substrings and return them as a sequence

tokenize("just plain  text", "\s+") ≡ ( "just", "plain", "text" )

tokenize("1,15,,24,50,", ",")       ≡ ( "1", "15", "", "24", "50", "" )

tokenize("HTML <BR> tag<br />soup", "\s*<br\s*/?>\s*", "i") ≡ ("HTML", "tag", "soup")

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(23) Analyzing Strings

XPath functions work on a string and return strings or sequences
analyze-string executes XSLT code for parts of the string
- XSLT code can create elements and/or attributes based on string input
- transforming text in XML often is referred to as up-conversion
two children contain code for handling the parsing process
- matching-substring is executed for each matching part
- non-matching-substring is executed for each non-matching part
- both of these elements are optional
- if two adjacent matching substrings are found, matching-substring is called twice

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(24) Replacing Characters with Elements

Replace all newline characters in the abstract element by br/ elements

<xsl:analyze-string select="abstract" regex="\n">
  <xsl:matching-substring>
    <br/>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

Transforming Text E. Wilde: XML Transformations (XSLT) 2.0 – Part I

(25) Replacing Character Markup

Turn textual conventions into XML markup
- citations are using […] for the citation identification

<xsl:analyze-string select="body" regex="\[(.*?)\]">
  <xsl:matching-substring>
    <cite><xsl:value-of select="regex-group(1)"/></cite>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="."/>
  </xsl:non-matching-substring>
</xsl:analyze-string>

XML Transformations (XSLT) 2.0 – Part I

XML Foundations [./]
Fall 2009 — INFO 242 (CCN 42575)

Erik WildeUC Berkeley School of Information, UC Berkeley School of Information
2009-10-20

Contents

(2) Abstract

(3) XSLT 1.0 Restrictions

(4) XSLT 2.0 Improvements

Multiple Result Documents

Outline (Multiple Result Documents)

(6) One XML, Many HTML

(7) Generating HTML Pages

Creating Identifiers

Outline (Creating Identifiers)

(9) Navigable Hypertext

(10) Reuse Existing Identifiers

(11) Generate Content-Based Identifiers

(12) Generate Random Identifiers

(13) Using Existing Identifiers

Text Processing

Outline (Text Processing)

(15) Text Processing in XSLT 1.0

Accessing Text Documents

Outline (Accessing Text Documents)

(17) Non-XML in an XML World

(18) Comma-Separated Values (CSV)

Transforming Text

Outline (Transforming Text)

(20) Regular Expressions

(21) Matching & Replacing

(22) Tokenizing

(23) Analyzing Strings

(24) Replacing Characters with Elements

(25) Replacing Character Markup

Conclusions

Outline (Conclusions)

(27) A Better XSLT

XML Transformations (XSLT) 2.0 – Part I

XML Foundations [./]Fall 2009 — INFO 242 (CCN 42575)

Erik WildeUC Berkeley School of Information, UC Berkeley School of Information2009-10-20

Contents

(2) Abstract

(3) XSLT 1.0 Restrictions

(4) XSLT 2.0 Improvements

Multiple Result Documents

Outline (Multiple Result Documents)

(6) One XML, Many HTML

(7) Generating HTML Pages

Creating Identifiers

Outline (Creating Identifiers)

(9) Navigable Hypertext

(10) Reuse Existing Identifiers

(11) Generate Content-Based Identifiers

(12) Generate Random Identifiers

(13) Using Existing Identifiers

Text Processing

Outline (Text Processing)

(15) Text Processing in XSLT 1.0

Accessing Text Documents

Outline (Accessing Text Documents)

(17) Non-XML in an XML World

(18) Comma-Separated Values (CSV)

Transforming Text

Outline (Transforming Text)

(20) Regular Expressions

(21) Matching & Replacing

(22) Tokenizing

(23) Analyzing Strings

(24) Replacing Characters with Elements

(25) Replacing Character Markup

Conclusions

Outline (Conclusions)

(27) A Better XSLT

XML Foundations [./]
Fall 2009 — INFO 242 (CCN 42575)

Erik WildeUC Berkeley School of Information, UC Berkeley School of Information
2009-10-20