The Good, the Bad, and the Ugly

XML Foundations [./]
Fall 2010 — INFO 242 (CCN 42593)

Erik Wilde, UC Berkeley School of Information
2010-09-16

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: The Good, the Bad, and the Ugly

Contents

E. Wilde: The Good, the Bad, and the Ugly

(2) Abstract

While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies.



E. Wilde: The Good, the Bad, and the Ugly

(3) XML Best Practices



XML Best Practices

Outline (XML Best Practices)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
XML Best Practices E. Wilde: The Good, the Bad, and the Ugly

(5) Markup and Schemas



XML Documents

Outline (XML Documents)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
XML Documents E. Wilde: The Good, the Bad, and the Ugly

(7) Generating XML

  • Character encoding
    • use one of XML's standard encodings (UTF-8 or UTF-16)
    • if you are using mostly latin characters, UTF-8 is much more compact
    • any other character encoding may cause interoperability issues
  • Pretty-printing (adding line feeds and indentation)
    • pretty-printed XML is easier to read for humans
    • pretty printed XML contains unnecessary whitespace
    • pretty-printing is good for experiments and prototypes
    • pretty printing should be switched off for production systems


XML Documents E. Wilde: The Good, the Bad, and the Ugly

(8) XML Views

  • Other people may use different tools
    • XML is a character-based format, so every character counts
    • other people may choose different technologies
    • even your XML editor may choose to see things differently
  • Many XML technologies use abstractions
    • useful for concentrating on the tree view
    • no full control of markup usage (automatic serialization)
    • think about working with a tree rather than working with a text file


XML DTDs

XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(10) From Model to Markup

  • There should be a conceptual model of the data
    • formal conceptual models for XML are an active field of research
    • informal models may use any notation (prose, graphics, combinations of these)
  • Model design should omit questions of markup design
    • element/attribute decisions are not a model question
    • hierarchy/reference decisions are not a model question
    • identifying the relevant entities and their relationships is a good idea
  • Document engineering never invented modeling tools
    • for document modelers, the markup is the model
    • there are no established notations for modeling documents
    • document-type parts (e.g., mixed content) are hard to include in models


XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(11) From Graphs to Trees

  • In the model, n:m relationships may appear
    • in an address database, an address should be reusable
    • in a résumé, an organization's information should be reusable
  • XML documents are trees
    • all non-tree structures must be represented by tree structures
    • in most cases, this will be done by introducing references


XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(12) From Markup to Model

  • Start with a sample instance
    1. start with a sample instance
    2. generate a schema for the instance with some tool
    3. open up the schema where necessary
    4. try creating more example instances as different as possible/required
    5. write code for manipulating your test set of instances
  • Restarting may be hard, but should be done
    • view the initial design as a test bed, not as the first version
    • after you have learned some lessons, throw everything away
    • restart by designing everything from scratch
    • content may be salvaged by writing small XSLT programs


XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(13) Top-Down or Bottom-Up?

  • Top-Down looks at markup as a serialization of a model
  • Bottom-Up builds markup incrementally from existing modules
  • Both strategies have strengths and shortcomings
    • top-down tends to result in markup which looks generated
    • bottom-up tends to result in markup which is less consistent
  • Consistency is an important consideration
    • if you dislike attributes, avoid them wherever possible
    • if you like attributes, use them wherever possible
    • don't mix these two styles of markup design


XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(14) Reuse is Good

  • Elements can be reused in different contexts
    • elements then appear in the content model of more than one element
    • an address may be used for employee as well as for customer
  • Content can be reused in different contexts
    • (parts of) a content model may be useful in different contexts
    • this only reuses an element's content, but not its name
  • Attributes can be reused in different contexts
    • technically, attributes are element-specific and have no relations when appearing on different elements
    • when reusing attribute names, they should represent the same concept
<company>
 <people>
  <customer added="2006-08-30">
   <name>Erik Wilde</name>
   <address> ... (containing elements) ... </address>
  </customer>
  <employee id="e65783" added="2006-08-28">
   <address> ... (containing elements) ... </address>
  </employee>
  <temporary_employee id="e65784" added="2006-09-02">
   <address> ... (containing elements) ... </address>
  </temporary_employee>
 </people>
</company>


XML DTDs E. Wilde: The Good, the Bad, and the Ugly

(15) Reuse is Hard (in DTDs)

  • Element reuse simply lists the element in more than one content model
  • Content reuse requires parameter entities
  • Attribute reuse requires parameter entities
  • Nested parameter entities for multi-level reuse
<!ENTITY % added "added CDATA #REQUIRED" >
<!ENTITY % employee_content "address, department?" >
<!ENTITY % employee_attributes "id CDATA #REQUIRED %added;" >
<!ELEMENT company (people) >
<!ELEMENT people (customer | employee | temporary_employee)* >
<!ELEMENT address (#PCDATA) >
<!ELEMENT customer (name, address) >
<!ATTLIST customer %added; >
<!ELEMENT name (#PCDATA)>
<!ELEMENT employee (%employee_content;) >
<!ATTLIST employee %employee_attributes; >
<!ELEMENT temporary_employee (%employee_content;) >
<!ATTLIST temporary_employee %employee_attributes; >


General XML Issues

Outline (General XML Issues)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
General XML Issues E. Wilde: The Good, the Bad, and the Ugly

(17) Element vs. Attribute

  • Elements and attributes are containers
    • both contain character content
  • Elements may carry attributes and may contain other elements
    • for nested structures, elements must be chosen
    • if the content needs to be annotated with an attribute, an element must be chosen
    • if the item should be repeatable, an element must be chosen
  • Attributes use less markup and have types
    • if the content is (unstructured) metadata, an attribute may be a good choice
    • for special types (ID/IDREF and enumerations), attributes are required
    • if simple markup is an issue, attributes may be preferable
  • Be consistent in you markup design style!


General XML Issues E. Wilde: The Good, the Bad, and the Ugly

(18) Hierarchy vs. Reference

  • Hierarchies are only possible with 1:n relationships
    • for n:m relationships, references are the only possible representation
  • Containment should be represented as hierarchy
    • containment limits the lifetime of the contained part to that of the container
<addressbook>
 <person>
  <name>Erik Wilde</name>
  <address> ... (containing elements) ... </address>
 </person>
 <person>
  <name>Katrina Lindholm</name>
  <address> ... (containing elements) ... </address>
 </person>
</addressbook>
<addressbook>
 <person address="a1">
  <name>Erik Wilde</name>
 </person>
 <person address="a2">
  <name>Katrina Lindholm</name>
 </person>
 <address id="a1"> ... (containing elements) ... </address>
 <address id="a2"> ... (containing elements) ... </address>
</addressbook>


General XML Issues E. Wilde: The Good, the Bad, and the Ugly

(19) Granularity

  • XML structures should identify the relevant information
    • what exactly means relevant?
    • very high granularity makes data acquisition hard
    • very high granularity makes data processing easy
  • Granularity is a general problem of data modeling
    • XML is simply a syntax for representing structured data
      <phone>+1-510-6432253</phone>
      <phone cc="1" area="510" local="6432253"/>
  • Overmodeling can make it hard to adapt to changes
    • treating countries as an enumeration requires a dynamic encyclopedia
    • treating a postal address as highly granular tends to break for foreign postal systems


Bad XML

Outline (Bad XML)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
Bad XML E. Wilde: The Good, the Bad, and the Ugly

(21) Consistent Markup



Bad XML E. Wilde: The Good, the Bad, and the Ugly

(22) Simple Markup



Ugly XML

Outline (Ugly XML)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
Ugly XML E. Wilde: The Good, the Bad, and the Ugly

(24) Redundant Data



Ugly XML E. Wilde: The Good, the Bad, and the Ugly

(25) Redundancy in the Schema



Ugly XML E. Wilde: The Good, the Bad, and the Ugly

(26) Generically Generated Markup

<objects>
 <object type="person" id="o3444">
  <attribute name="name">Erik Wilde</attribute>
  <attribute name="email">dret@ischool.berkeley.edu</attribute>
  <relation type="office" object="o3445"/>
 </object>
 <object type="address" id="o3445">
  <attribute name="room">314 South Hall</attribute>
  <attribute name="phone" type="voice">+1-510-6432253</attribute>
  <attribute name="phone" type="fax">+1-510-6425814</attribute>
  <relation type="worker" object="o3444"/>
 </object>
</objects>


XML Information Set (XML Infoset)

Outline (XML Information Set (XML Infoset))

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
XML Information Set (XML Infoset) E. Wilde: The Good, the Bad, and the Ugly

(28) What is the Content of an XML Document?



XML Information Set (XML Infoset) E. Wilde: The Good, the Bad, and the Ugly

(29) Infoset Example

infoset-example.png

XML Information Set (XML Infoset) E. Wilde: The Good, the Bad, and the Ugly

(30) What is Not in the Infoset



Conclusions

Outline (Conclusions)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [1]
Conclusions E. Wilde: The Good, the Bad, and the Ugly

(32) XML and Modeling



2010-09-16 XML Foundations [./]
Fall 2010 — INFO 242 (CCN 42593)