Document Type Definition (DTD)

XML Foundations (INFO 242)

Erik Wilde, UC Berkeley School of Information
2007-09-11
Creative Commons License

This work is licensed under a CC
Attribution 3.0 Unported License

Abstract

The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application.

Outline (Schema Languages)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

XML Validation

Validation and Applications

valid-documents.png

Non-XML, Well-Formed, and Valid

<address>
 <name short="iSchool">School of Information</name>
 <voice>(510) 642-1464</phone>
 <fax>(510) 642-1464</fax>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>
<address>
 <name short="iSchool">School of Information</name>
 <voice>(510) 642-1464</voice>
 <fax>(510) 642-1464</fax>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>
<address>
 <name short="iSchool">School of Information</name>
 <phone type="voice">(510) 642-1464</phone>
 <phone type="fax">(510) 642-1464</phone>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>

DTD Example

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">
<!ELEMENT address (name, phone*, website*, postal?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name
 short CDATA #REQUIRED
>
<!ELEMENT phone (#PCDATA)>
<!ATTLIST phone
 type ( voice | fax ) #REQUIRED
>
<!ELEMENT postal (#PCDATA)>
<!ELEMENT website (#PCDATA)>

XML Schema Languages

Outline (DTD Basics)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

XML is SGML light

Connecting Documents and DTDs

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">

Outline (DTD Syntax)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

DTDs are not XML Documents

<!ELEMENT address (name, phone*, website*, postal?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name
 short CDATA #REQUIRED
>
<!ELEMENT phone (#PCDATA)>
<!ATTLIST phone
 type ( voice | fax ) #REQUIRED
>
<!ELEMENT postal (#PCDATA)>
<!ELEMENT website (#PCDATA)>

Syntax Rules

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">
<address>

Outline (Defining Elements)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

Element Only Content

<!ELEMENT table
     (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
<!ELEMENT caption  %Inline;>
<!ELEMENT thead    (tr)+>
<!ELEMENT tfoot    (tr)+>
<!ELEMENT tbody    (tr)+>
<!ELEMENT colgroup (col)*>
<!ELEMENT col      EMPTY>
<!ELEMENT tr       (th|td)+>
<!ELEMENT th       %Flow;>
<!ELEMENT td       %Flow;>

Mixed Content

<!ELEMENT address (#PCDATA | %inline; | %misc.inline; | p)*>
<!ELEMENT style (#PCDATA)>

Empty Content

<!ELEMENT img EMPTY>
<!ATTLIST img
  %attrs;
  src         %URI;          #REQUIRED
  alt         %Text;         #REQUIRED
  name        NMTOKEN        #IMPLIED
  longdesc    %URI;          #IMPLIED
  height      %Length;       #IMPLIED
  width       %Length;       #IMPLIED
  usemap      %URI;          #IMPLIED
  ismap       (ismap)        #IMPLIED
  align       %ImgAlign;     #IMPLIED
  border      %Length;       #IMPLIED
  hspace      %Pixels;       #IMPLIED
  vspace      %Pixels;       #IMPLIED
  >

Outline (Defining Attribute Lists)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

Attributes belong to Elements

<!ELEMENT param EMPTY>
<!ATTLIST param
  id          ID             #IMPLIED
  name        CDATA          #REQUIRED
  value       CDATA          #IMPLIED
  valuetype   (data|ref|object) "data"
  type        %ContentType;  #IMPLIED
  >

Attribute Types

  accept-charset %Charsets;  #IMPLIED
<!ENTITY % Charsets "CDATA">
    <!-- a space separated list of character encodings, as per [RFC2045] -->

Outline (Advanced DTDs)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

Outline (ID/IDREF)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

References in Documents

ID/IDREF in a Document

<document>
 <section id="sgml" author="dret">
  <title>Standard Generalized Markup Language (SGML)</title>
  <p>SGML is an ISO standard ...</p>
  <section id="sgml-syntax" author="bob">
   <title>SGML Syntax</title>
   <p>SGML uses markup, which is ...</p>
  </section>
 </section>
 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
<!ELEMENT section ( title, p+, section* ) >
<!ATTLIST section
 id ID #REQUIRED
 author CDATA #REQUIRED >
<!ELEMENT title ( #PCDATA )>
<!ELEMENT p ( #PCDATA | ref )*>
<!ATTLIST p
 type CDATA #IMPLIED >
<!ELEMENT ref EMPTY >
<!ATTLIST ref
 name IDREF #REQUIRED >

References within the Tree

section.png

Formatting Example

XSLidy can generate links to sections such as the section about ID/IDREF, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.

<p>XSLidy can generate links to sections such as the section about <link href="ididref"/>, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.</p>

After running XSLidy, the following HTML is generated:

<p>XSLidy can generate links to sections such as the section about <a href="#(23)">ID/IDREF</a>, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.</p>

ID/IDREF Semantics

Outline (Entities)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

General Entities

<!ENTITY aacute "&#225;"> <!-- latin small letter a with acute,
                                  U+00E1 ISOlat1 -->
<!ENTITY acirc  "&#226;"> <!-- latin small letter a with circumflex,
                                  U+00E2 ISOlat1 -->
<!ENTITY atilde "&#227;"> <!-- latin small letter a with tilde,
                                  U+00E3 ISOlat1 -->
<!ENTITY auml   "&#228;"> <!-- latin small letter a with diaeresis,
                                  U+00E4 ISOlat1 -->

Parameter Entities

XHTML Parameter Entities (Attributes)

<!ELEMENT p %Inline;>
<!ATTLIST p
  %attrs;
  %TextAlign;
  >
<!ENTITY % attrs "%coreattrs; %i18n; %events;">
<!ENTITY % coreattrs
 "id          ID             #IMPLIED
  class       CDATA          #IMPLIED
  style       %StyleSheet;   #IMPLIED
  title       %Text;         #IMPLIED"
  >
<!ENTITY % i18n
 "lang        %LanguageCode; #IMPLIED
  xml:lang    %LanguageCode; #IMPLIED
  dir         (ltr|rtl)      #IMPLIED"
  >
<!ENTITY % LanguageCode "NMTOKEN">
    <!-- a language code, as per [RFC3066] -->
<!ENTITY % TextAlign "align (left|center|right|justify) #IMPLIED">

XHTML Parameter Entities (Content)

<!ELEMENT p %Inline;>
<!ATTLIST p
  %attrs;
  %TextAlign;
  >
<!ENTITY % Inline "(#PCDATA | %inline; | %misc.inline;)*">
<!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;">
<!ENTITY % special
   "%special.basic; | %special.extra;">
<!ENTITY % special.basic
 "br | span | bdo">
<!ENTITY % special.extra
   "object | applet | img | map | iframe">
<!ENTITY % misc.inline "ins | del | script">

Outline (More Advanced DTDs)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

Additional Mechanisms

Outline (Conclusions)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. More Advanced DTDs [1]
  5. Conclusions [2]

DTD for XML Schemas

Modeling DTDs