XML Foundations

INFO 242 (CCN 42736) – Fall 2007
School of Information, UC Berkeley

Instructor: Erik Wilde

Lecture & Lab: Tu & Th, 14.00–15.30, 110 South Hall

Description: The Extensible Markup Language (XML), with its ability to define formal structural and semantic definitions for metadata and information models, is the key enabling technology for information services and document-centric business models that use the Internet and its family of protocols. This course introduces XML syntax, transformations, schema languages, and the querying of XML databases. It balances conceptual topics with practical skills for designing, implementing, and handling conceptual models as XML schemas.

Date Subject Slides Resources
2007-08-28 Overview and Introduction: The Extensible Markup Language (XML) has been introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was considered as conveying not enough semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that since its invention has been taken to higher levels with the idea of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data. Introduction (35 Slides) XML 1.0 Press Release
2007-08-30 Blogging in XML: XML in used in a wide variety of application scenarios, resulting in a wide variety of requirements. This lecture introduces the application example used in this course, which is the representation of blog data in XML. Blogs are a good example for XML, because of their mix of structured data (blog post metadata) and textual data (the actual blog post), the requirement to derive different views (such as weekly and monthly summaries) from the same set of data, and the requirement to make the data available in various output formats (such as HTML and RSS). Blogging in XML (24 Slides)
2007-09-04 XML Basics: The Extensible Markup Language (XML) defines a simple way for structuring data. The power and popularity of XML can be explained by its versatility, the platform-independence, the standards and technologies leveraging it, and the number of tools and products supporting it. Understanding XML itself is rather simple, it only depends on a very small set of other technologies. Unicode and URIs are the most important foundations of XML. XML itself specifies two different things: on the one hand the format for structured data, which are called XML documents, and on the other hand a constraint language for XML documents, which is called Document Type Definition (DTD). Basics (29 Slides) Spec
2007-09-06 Processing XML: XML is a format for structured data, but it does not prescribe any way of processing these structures. In practice, XML data has to processed by using XML-specific support in some programming environment. In this lecture, the most popular ways of processing XML data are discussed; the Document Object Model (DOM) as a tree-based data model, the Simple API for XML (SAX) as an event-based programming model, and XSL Transformations (XSLT) as a dedicated programming language for transforming XML. Processing XML (20 Slides) DOM · SAX
2007-09-11 Document Type Definition (DTD): The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application. DTD (38 Slides) XML QuickRef
2007-09-13 The Good, the Bad, and the Ugly: While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies. Best Practices (33 Slides) Structuring Content with XML · On XML Language Design
2007-09-18 XML Namespaces: XML is successful because it can be used in many different scenarios, and because it is easy to define a schema (such as a DTD) for new scenarios, producing a tailored XML data model for this scenario. This means that names in XML documents must be interpreted as belonging to a certain schema. As long as a document uses names from only one schema, this can be done rather easily. However, in many scenarios today documents combine names from different schemas, and XML Namespaces provide a mechanism how the names in an XML document can be associated with a namespace. Namespaces (26 Slides) XML Namespaces FAQ (Part I) · Spec
2007-09-20 XML Path Language (XPath): XML structures data into a rather small number of different constructs, most notably elements and attributes. The XML Path Language (XPath) defines a way how to select parts of XML documents, so that they can be used for further processing. XPath's primary use in in XSL Transformations (XSLT), but other XML technologies use it as well, e.g. XSDL. XPath is a very compact language with a syntax that resembles path expressions well-known from file systems. These path expressions, however, are generalized and therefore much more powerful than the rather simple path expressions in file systems. Because of its use in different XML technologies, XPath is one of the most important XML core technologies. XPath (35 Slides) XPath Chapter · XPath QuickRef
2007-09-25 XML Transformations (XSLT) – Part I: Because XML can be used to represent any vocabulary (often defined by some schema), the question is how these different vocabularies can be processed and maybe transformed into something else. This something else may be another XML vocabulary (a common requirement in B2B scenarios), or it may be HTML (a common scenario for Web publishing). Using XSL Transformations (XSLT), mapping tasks can be implemented easily. XSLT leverages XPath's expressive power in a rather simple programming language, the programs are often called stylesheets. For easy tasks, XSLT mappings can be specified without much real programming going on, by simply specifying how components of the source markup are mapped to components of the target markup. XSLT 1 (22 Slides)
2007-09-27 XML Transformations (XSLT) – Part II: XSLT processes documents by matching nodes in the document tree to templates, which then are executed to process these nodes. This process of matching and executing templates is the core of XSLT's processing model. XSLT has built-in templates which complement the user-supplied templates, so that the XSLT processor always finds a template to execute. Templates can conflict, and it is then necessary to resolve this conflict by finding the best match of all matching templates. This conflict resolution process also is a very important component of the XSLT processing model. XSLT 2 (26 Slides)
2007-10-02 XML Transformations (XSLT) – Part III: XSLT's template matching mechanism lets the XSLT processor find the best match to process a selected node. XSLT also supports a more traditional way of using templates, where they are called in a way very similar for function calls in most programming languages. Another interesting area of XSLT are variables and parameters, which are used for storing or passing values within XSLT code. One special property of XSLT variables is that they cannot be changed, which is a result of the functional design of the language. XSLT 3 (24 Slides) XSLT Parameters
2007-10-04 XML Transformations (XSLT) – Part IV: Advanced XSLT processing includes better control of the input and output documents, which can be finely controlled in terms of how whitespace is treated. Another interesting feature of XSLT are keys, which allow shorthand notations for frequently used access paths to nodes, and provide XSLT processors with more information for performance optimizations. Instructions for creating all possible kinds of nodes in the output tree make it possible to write code which generates element or attribute names based on runtime evaluations. XSLT 4 (26 Slides)
2007-10-09 XML Path Language (XPath) 2.0: The XML Path Language (XPath) is one of the most useful and frequently used languages in the are of XML technologies. In its version 1.0, it is used in technologies such as XSLT, XSDL, DOM, and XML Tools. With XPath 2.0, the language has been greatly extended, the new version of XPath is the foundation for XSLT 2.0 and XQuery. XPath 2.0 provides support for regular expression matching, typed expressions, and contains language constructs for conditional and repeated evaluation. XPath 2.0 (35 Slides) Spec
2007-10-11 XQuery 1.0 and XPath 2.0 Data Model (XDM): While XPath 2.0 syntactically is an extension of XPath 1.0, the underlying data model has changed quite radically. Instead of XPath 1.0's simple concept of four datatypes (node set, number, string, boolean), the XQuery 1.0 and XPath 2.0 Data Model (XDM) is based on sequences and allows much more sophisticated ways of data representation and manipulation. Furthermore, XDM includes the datatypes defined by XSDL, which results in an complex and powerful collection of built-in datatypes and operations on these datatypes. XDM (28 Slides) Spec
2007-10-16 XML Transformations (XSLT) 2.0 – Part I: While XML Transformations (XSLT) 1.0 has become a successful programming language widely used for transforming XML documents, its limitations sometimes make it difficult to use XSLT in a good way. An important reason for many of the limitations is the fact that XSLT 1.0 has been designed as a client-side language. Building on XSLT 1.0 and XPath 2.0, XML Transformations (XSLT) 2.0 improves the language in a variety of ways. XSLT 2.0 1 (27 Slides) Spec
2007-10-18 XML Transformations (XSLT) 2.0 – Part II: Many of the new features of XSLT 2.0 have their roots in XPath 2.0 and the underlying new data model of sequences. But some features of XSLT 2.0 really are part of the language itself, such as support for user-defined functions, and the ability to group items and then iterate over these groups. In addition, XSLT now can be used as a typed programming language, which consumes and produces typed trees instead of just well-formed XML trees. XSLT 2.0 2 (17 Slides) Reevaluating XSLT 2.0
2007-10-23 XSDL – Part I: The XML Schema Definition Language (XSDL) is the most popular schema language for XML today. It has been introduced to overcome some of the commonly observed limitations of DTDs, most notably the lack of typing. Simple Types describe content which is not structured by XML markup, which means it describes attribute values and element content. Simple types can be defined by deriving new types from existing types by using type restriction. XSDL 1 (27 Slides) XSDL QuickRef · XML Schema
2007-10-25 XSDL – Part II: XSDL Complex Types describe element content if this content is using attributes and/or element content other than only character data. Thus, complex types are used to define the allowed markup structures for a class of documents. Using XSDL's type concepts, it is easier to represent model-level information in a schema, because type hierarchies can represent model-level specializations. XSDL 2 (27 Slides)
2007-10-30 XSDL – Part III: XSDL allows greater flexibility in defining constraints on intra-document references than the ID/IDREF construct of DTDs. XSDL's Identity Constraints are scoped, typed, and can be used for elements or attributes. They are more powerful that the DTD's limited ID/IDREF mechanism, but still lack sufficient generality to support a really wide set of model constraints to be expressed. XSDL 3 (24 Slides) XSDL Identity Constraints
2007-11-01 XSDL – Part IV: XSDL complex types can be derived by restriction or extension. Complex type restriction defines the restricted type to be a more restricted version of the base type. Complex type extension make it possible to extend the base type by either adding attributes or contents (only by appending new content to the content model). Complex type derivation allows XSDL to express type hierarchies of complex types, which can be aligned with more or less specialized code for processing instances of these types. XSDL 4 (16 Slides)
2007-11-06 From Model to Markup: While XML is very useful for representing and manipulating structured data, the question remains where these structures come from. They are usually some kind of encoding for a conceptual model, but there is no established and universally accepted way of how to connect the modeling world with XML markup. Some of the challenges and approaches to XML and modeling will be presented in this lecture. The goal of this lecture is to raise awareness for the current gap between models and markup, and for practical approaches how to bridge that gap. Modeling (42 Slides)
2007-11-08 Alternative Schema Languages – Schematron: XSDL is only one representative from a class of languages which are all designed for the purpose of testing whether some XML document satisfies a set of constraints. This test could of course also be conducted programmatically, but this is not portable and not easily maintainable. Schema languages thus often use a declarative approach to specifying how to conduct validation. A very simple yet very powerful language for this is Schematron, which uses the expressive power of XPath for testing whether a document satisfies a set of conditions. Schematron is rule-based in contrast to the more traditional grammar-based schema languages and complements these very well. Schema Languages (39 Slides) The Design of RELAX NG · Schematron
2007-11-13 XML Query (XQuery) – Part I: The XML Query (XQuery) language has been designed to query collections of XML documents. It is thus different from XSLT, which primarily transforms one document at a time. However, the core of both languages is XPath 2.0, which means that learning XQuery (and XSLT 2.0) is not very hard when starting with a solid knowledge of XPath 2.0. XQuery's main concept is an expression language which supports iteration and binding of variables to intermediate results. The final result of an XQuery is a tree, which can be serialized in various serialization formats. XQuery 1 (27 Slides) Spec
2007-11-15 XML Query (XQuery) – Part II: XQuery has been built on top of XPath 2.0, which means it uses the same foundation as XSLT 2.0. Both languages have a large overlap, and according to personal preferences and the XML task, one language may be preferred over the other. Features such as user-defined functions and schema-awareness bring XQuery even closer to XSLT 2.0, making the decision to choose one over the other mostly a question of personal preference. XQuery 2 (20 Slides) XQuery/XSLT Comparison
2007-11-20 XML Databases: XML Databases are specialized databases for handling XML data. As their query language, they will often use XQuery, but they need additional technologies for updating and storing data. XQuery currently is a read-only language, so update facilities must be provided as an addition to XQuery querying capabilities. One of the big advantages of databases vs. file systems are optimized storage (and thus access) structures, and in the case of XML databases this means storing XML documents other than as text files. XDBMS (30 Slides) eXist
2007-12-04 XML and Databases: While XML databases are a good solution for managing XML content, frequently it is necessary to uses non-XML databases for managing XML content. In most cases, these databases will be relational databases. There a two major approaches of how to manage XML content in a relational database. The first approach is to define a mapping between XML and relational structures and work with this mapping. The second approach is to use the XML-specific functionality, which is increasingly provided by relational databases, turning them into XML-aware databases. XML & DBMS (32 Slides) FAQ
2007-12-06 XML Trends & Developments: XML is a very basic technology for representing trees using a standardized markup-based syntax. An increasing number of technologies are building on this foundation, creating an expanding field of XML-based technologies for interoperability in many different fields. Application-specific XML-based data formats are used in many different settings, and the best data format for a given scenario depends on the existing formats in this area and the exact requirements. More interestingly, generic XML technologies which can be applied in many different settings make it easier for developers and system integrators to achieve their goal of making system interoperate. XML Trends (25 Slides) W3C XML Activity Statement
Show Abstracts
Hide Abstracts
Creative Commons License Please send comments to dret@berkeley.edu
Last modification on Thursday, 20-Nov-2008 02:07:54 CET
valid CSS! valid XHTML 1.0!