Data Model Perspectives for XML Schema

Felix Michel, Erik Wilde


The family of upcoming XML technologies, consisting of XPath 2.0, XSLT 2.0, and XQuery, no longer operates only on the Infoset, but also utilize schema information. Today, this schema information is added to the Infoset during schema-validation and commonly is referred to as PSVI contributions (PSVI for "Post-Validation Schema Infoset"). Utilizing schema information is promising, for XML Schema allows to describe relationships between structures in an expressive, semantically relevant way, e.g. through type derivation and substitution groups. This structural information can become valuable meta-data when processing instances that comply to the respective Schema. However, only a small fraction of this schema information is accessible with the aforementioned technologies. There are various reasons for this: Some schema information such as where wildcards can occur is not exposed at all, and other components (e.g. types) are only represented by QNames, lacking any possibilities to further navigate the schema information. Secondly, the PSVI specification remains vague with respect to the data model. And finally, the present data model of XML Schema is not appropriate for some application contexts. The existence of differing data models for XML Schema (e.g. in programming APIs for XML Schema) is evidence for the fact that the abstract data model as defined in the recommendation does not rule out the need for other data model perspectives. In fact, the abstract data model and its incarnations (namely the normative XML syntax) may be good for defining schemas, but it proves to be less appropriate for exploiting the structural information. Features that are convenient for definition (such as named groups and nested model groups) turn out to be problematic for retrieval and navigation, the most important ways of using the structural information. We propose an alternative data model perspective that represents the schema information in a way that meets the needs of certain classes of applications better. These applications have in common read-only access to schema information, an instance-driven perspective, the need for schema inspection at runtime, and possibly only a local scope. Our data model uses what we call "occurrences" instead of the "particles" in the normative abstract data model, and it expands what we (deliberately) consider to be notational shorthands (like occurrence constraints and named groups). Furthermore, we index all occurrences (even of the same element), as it is done in "marked expressions" in regular language theory. The structural information is not longer captured by model groups, but by a set of potential next occurrences. This is based on the idea of Brzozowski derivatives and again inspired by the anticipated needs of instance-oriented applications. We present a prototype implementation which is purely based on standard technologies. It is implemented as a XSLT 2.0 function library that reads schemas in the normative XML syntax, constructs the data model from this information, and provides various functions for accessing, navigating, and exploiting the schema information. We show that such functionality is highly beneficial, making applications more powerful, resilient, and easier to develop.


Bibliography Navigation: Reference List; Author Index; Title Index; Keyword Index

Generated by sharef2html on 2011-04-15, 02:00:41.