The Good, the Bad, and the Ugly

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2007-09-13
Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

While XML it rather easy to understand and use, it is also rather easy to use XML in ways which either produce ugly XML, or which may lead to problems in components further processing the XML. The topic of this lecture thus is to look at design guidelines for XML schemas, leading to good XML. Some of the simpler topics cover basic questions of how to map a data model to XML markup (e.g., when to use elements or attributes). The next question is how data should be represented in XML so that applications can process it efficiently. We also look at what part of the markup an application will actually have access to, and this is defined by the XML Information Set (Infoset), the specification underlying many XML technologies.

XML Best Practices

Outline (XML Best Practices)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

Markup and Schemas

Outline (XML Documents)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

Generating XML

XML Views

Outline (XML DTDs)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

From Model to Markup

From Graphs to Trees

From Markup to Model

Top-Down or Bottom-Up?

Reuse is Good

<company>
 <people>
  <customer added="2006-08-30">
   <name>Erik Wilde</name>
   <address> ... (containing elements) ... </address>
  </customer>
  <employee id="e65783" added="2006-08-28">
   <address> ... (containing elements) ... </address>
  </employee>
  <temporary_employee id="e65784" added="2006-09-02">
   <address> ... (containing elements) ... </address>
  </temporary_employee>
 </people>
</company>

Reuse is Hard (in DTDs)

<!ENTITY % added "added CDATA #REQUIRED" >
<!ENTITY % employee_content "address, department?" >
<!ENTITY % employee_attributes "id CDATA #REQUIRED %added;" >
<!ELEMENT company (people) >
<!ELEMENT people (customer | employee | temporary_employee)* >
<!ELEMENT address (#PCDATA) >
<!ELEMENT customer (name, address) >
<!ATTLIST customer %added; >
<!ELEMENT name (#PCDATA)>
<!ELEMENT employee (%employee_content;) >
<!ATTLIST employee %employee_attributes; >
<!ELEMENT temporary_employee (%employee_content;) >
<!ATTLIST temporary_employee %employee_attributes; >

Outline (General XML Issues)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

Element vs. Attribute

Hierarchy vs. Reference

<addressbook>
 <person>
  <name>Erik Wilde</name>
  <address> ... (containing elements) ... </address>
 </person>
 <person>
  <name>Katrina Lindholm</name>
  <address> ... (containing elements) ... </address>
 </person>
</addressbook>
<addressbook>
 <person address="a1">
  <name>Erik Wilde</name>
 </person>
 <person address="a2">
  <name>Katrina Lindholm</name>
 </person>
 <address id="a1"> ... (containing elements) ... </address>
 <address id="a2"> ... (containing elements) ... </address>
</addressbook>

Granularity

Outline (Bad XML)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

Consistent Markup

Simple Markup

Outline (Ugly XML)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

Redundant Data

Redundancy in the Schema

Generically Generated Markup

<objects>
 <object type="person" id="o3444">
  <attribute name="name">Erik Wilde</attribute>
  <attribute name="email">dret@ischool.berkeley.edu</attribute>
  <relation type="office" object="o3445"/>
 </object>
 <object type="address" id="o3445">
  <attribute name="room">314 South Hall</attribute>
  <attribute name="phone" type="voice">+1-510-6432253</attribute>
  <attribute name="phone" type="fax">+1-510-6425814</attribute>
  <relation type="worker" object="o3444"/>
 </object>
</objects>

Outline (XML Information Set (XML Infoset))

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

What is the Content of an XML Document?

Infoset Example

infoset-example.png

What is Not in the Infoset

Outline (Conclusions)

  1. XML Best Practices [12]
    1. XML Documents [2]
    2. XML DTDs [6]
    3. General XML Issues [3]
  2. Bad XML [2]
  3. Ugly XML [3]
  4. XML Information Set (XML Infoset) [3]
  5. Conclusions [2]

XML and Modeling

Assignment