Document Type Definition (DTD)

XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)

Ray Larson, UC Berkeley School of Information
2011-09-01

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents R. Larson: Document Type Definition (DTD)

Contents

R. Larson: Document Type Definition (DTD)

(2) Abstract

The XML specification defines a format for structured data (XML documents) and a grammar-based constraint language for these (DTD). In SGML-based systems, DTDs were often very complex and feature-rich constructs, which controlled a lot of the processing of SGML documents. XML greatly simplified DTDs, and de-facto usage of DTDs today simplified them even more. In many systems today, DTDs are not used at all or generated from sample documents. In this lecture, it is argued that DTDs (or schemas, to be more general) should be taken seriously in any non-trivial XML application, because they are a representation of the underlying (and often underspecified) data model of the application.



Schema Languages

Outline (Schema Languages)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]
Schema Languages R. Larson: Document Type Definition (DTD)

(4) XML Validation



Schema Languages R. Larson: Document Type Definition (DTD)

(5) Validation and Applications

valid-documents.png

Schema Languages R. Larson: Document Type Definition (DTD)

(6) Non-XML, Well-Formed, and Valid

<address>
 <name short="iSchool">School of Information</name>
 <voice>(510) 642-1464</phone>
 <fax>(510) 642-1464</fax>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>
<address>
 <name short="iSchool">School of Information</name>
 <voice>(510) 642-1464</voice>
 <fax>(510) 642-1464</fax>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>
<address>
 <name short="iSchool">School of Information</name>
 <phone type="voice">(510) 642-1464</phone>
 <phone type="fax">(510) 642-1464</phone>
 <website>http://ischool.berkeley.edu/</website>
 <postal>...</postal>
</address>


Schema Languages R. Larson: Document Type Definition (DTD)

(7) DTD Example

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">
<!ELEMENT address (name, phone*, website*, postal?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name
 short CDATA #REQUIRED
>
<!ELEMENT phone (#PCDATA)>
<!ATTLIST phone
 type ( voice | fax ) #REQUIRED
>
<!ELEMENT postal (#PCDATA)>
<!ELEMENT website (#PCDATA)>


Schema Languages R. Larson: Document Type Definition (DTD)

(8) XML Schema Languages



DTD Basics

Outline (DTD Basics)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]
DTD Basics R. Larson: Document Type Definition (DTD)

(10) XML is SGML light



DTD Basics R. Larson: Document Type Definition (DTD)

(11) Associating Documents and DTDs

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">


DTD Syntax

DTD Syntax R. Larson: Document Type Definition (DTD)

(13) DTDs are not XML Documents

  • DTDs use a special syntax
    • somewhat ironic when everything else is XMLized
    • DTDs cannot be processed with standard XML tools
    • more compact than XML syntax
  • Definition of elements and attribute lists
    • elements are defined by the content they allow
    • attribute lists are sets of allowed attributes on elements
<!ELEMENT address (name, phone*, website*, postal?)>
<!ELEMENT name (#PCDATA)>
<!ATTLIST name
 short CDATA #REQUIRED
>
<!ELEMENT phone (#PCDATA)>
<!ATTLIST phone
 type ( voice | fax ) #REQUIRED
>
<!ELEMENT postal (#PCDATA)>
<!ELEMENT website (#PCDATA)>


DTD Syntax R. Larson: Document Type Definition (DTD)

(14) Syntax Rules

  • There is no container containing the whole DTD
    • <!ELEMENT example EMPTY> thus is a complete DTD
  • Definitions (officially called declarations) use <!… > syntax
  • The document element is not marked explicitly
    • but it must be declared in the document type declaration
    • this means the document element is established by the document, not by the DTD
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE address SYSTEM "address.dtd">
<address>


Defining Elements

Outline (Defining Elements)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]
Defining Elements R. Larson: Document Type Definition (DTD)

(16) Element Only Content

  • Element content is defined by a grammar for the children
    • sequences are indicated with a comma: ,
    • choices are indicated with a vertical bar: |
    • optional parts are indicated with a question mark: ?
    • repeatable parts are indicated with a plus: +
    • optional and repeatable parts are indicated with a asterisk: *
    • parentheses can be used for grouping and nesting
<!ELEMENT table
     (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
<!ELEMENT caption  %Inline;>
<!ELEMENT thead    (tr)+>
<!ELEMENT tfoot    (tr)+>
<!ELEMENT tbody    (tr)+>
<!ELEMENT colgroup (col)*>
<!ELEMENT col      EMPTY>
<!ELEMENT tr       (th|td)+>
<!ELEMENT th       %Flow;>
<!ELEMENT td       %Flow;>


Defining Elements R. Larson: Document Type Definition (DTD)

(17) Mixed Content

  • Mixed Content [XML Basics; Mixed Content (1)] allows text content and elements to be mixed
    • Whitespace [XML Basics; Whitespace (1)] characters are allowed in Element Only Content [Element Only Content (1)] (this must not be declared)
    • for non-whitespace characters, character data must be allowed explicitly
  • The allowed child elements may be constrained, but not their order or their number of occurrences
  • Mixed Content always is defined as <!ELEMENT x (#PCDATA | a | b | …)* >
<!ELEMENT address (#PCDATA | %inline; | %misc.inline; | p)*>
  • Character only content is a special case of mixed content
    • the element may only contain characters (no other elements)
    • the repetition is not necessary because there is no choice
<!ELEMENT style (#PCDATA)>


Defining Elements R. Larson: Document Type Definition (DTD)

(18) Empty Content

  • Empty elements can be useful
    • they may contain all information in attributes
    • their presence may carry semantics without the need for additional information
<!ELEMENT img EMPTY>
<!ATTLIST img
  %attrs;
  src         %URI;          #REQUIRED
  alt         %Text;         #REQUIRED
  name        NMTOKEN        #IMPLIED
  longdesc    %URI;          #IMPLIED
  height      %Length;       #IMPLIED
  width       %Length;       #IMPLIED
  usemap      %URI;          #IMPLIED
  ismap       (ismap)        #IMPLIED
  align       %ImgAlign;     #IMPLIED
  border      %Length;       #IMPLIED
  hspace      %Pixels;       #IMPLIED
  vspace      %Pixels;       #IMPLIED
  >


Defining Attribute Lists

Outline (Defining Attribute Lists)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]
Defining Attribute Lists R. Larson: Document Type Definition (DTD)

(20) Attributes belong to Elements

  • Attributes are specified in an element's Attribute List
    • an element definition may have any number of attributes associated with it
    • attributes may occur at most once on an element
  • Attributes definitions have a name, a type, and a default declaration
    • the attribute appears according to the default declaration
    • if the attribute is present, its value must conform to the type
<!ELEMENT param EMPTY>
<!ATTLIST param
  id          ID             #IMPLIED
  name        CDATA          #REQUIRED
  value       CDATA          #IMPLIED
  valuetype   (data|ref|object) "data"
  type        %ContentType;  #IMPLIED
  >


Defining Attribute Lists R. Larson: Document Type Definition (DTD)

(21) Attribute Types

  • Attribute values can be constrained (which is not possible for element content)
    • CDATA means any character string (but no markup)
    • enumerated types list allowed values: (data|ref|object) (list of XML names)
    • ID for identifying elements (part of ID/IDREF [ID/IDREF (1)])
    • IDREF for referencing identified elements (part of ID/IDREF [ID/IDREF (1)])
  • Application-oriented attribute types are often simulated
  accept-charset %Charsets;  #IMPLIED
<!ENTITY % Charsets "CDATA">
<!-- a space separated list of character encodings, as per [RFC2045] -->
  • The default declaration specifies the attribute's presence
    • #REQUIRED means the attribute has to be specified (on every element)
    • #IMPLIED marks an optional attribute (the parser may imply a value)
    • "…" specifies a default value (and the attribute is optional)


Advanced DTDs

Outline (Advanced DTDs)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]

ID/IDREF

ID/IDREF R. Larson: Document Type Definition (DTD)

(24) References in Documents

  • Without Validation, there are no IDs
    • ID is an attribute type [Attribute Types (1)] declared in the DTD
    • xml:id [XQuery – Part IV; XML IDs with xml:id (1)] is an attempt to support schema-independent IDs
  • IDs are used to assign identities to elements
    • the XML processor reports duplicate IDs as errors (part of validation [http://www.w3.org/TR/xml/#id])
  • IDREFs are used to reference existing IDs
    • the XML processor reports references to non-existing IDs as errors (part of validation [http://www.w3.org/TR/xml/#idref])
  • IDs must be XML Names (in particular, they may not start with a number)


ID/IDREF R. Larson: Document Type Definition (DTD)

(25) ID/IDREF in a Document

<document>
 <section id="sgml" author="dret">
  <title>Standard Generalized Markup Language (SGML)</title>
  <p>SGML is an ISO standard ...</p>
  <section id="sgml-syntax" author="bob">
   <title>SGML Syntax</title>
   <p>SGML uses markup, which is ...</p>
  </section>
 </section>
 <section id="xml" author="bob">
  <title>Extensible Markup Language (XML)</title>
  <p>XML is based on SGML (Section <ref name="sgml"/>) ...</p>
  <p type="example">XML can be used ...</p>
  <section id="xml-syntax" author="dret">
   <title>XML Syntax</title>
   <p>Section <ref name="sgml-syntax"/> describes ...</p>
<!ELEMENT section ( title, p+, section* ) >
<!ATTLIST section
 id ID #REQUIRED
 author CDATA #REQUIRED >
<!ELEMENT title ( #PCDATA )>
<!ELEMENT p ( #PCDATA | ref )*>
<!ATTLIST p
 type CDATA #IMPLIED >
<!ELEMENT ref EMPTY >
<!ATTLIST ref
 name IDREF #REQUIRED >


ID/IDREF R. Larson: Document Type Definition (DTD)

(26) References within the Tree

section.png

ID/IDREF R. Larson: Document Type Definition (DTD)

(27) Formatting Example

Hotspot can generate links to sections such as the section about ID/IDREF [ID/IDREF (1)], this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.

<p>Hotspot can generate links to sections such as the section about <link href="ididref"/>, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.</p>

After running Hotspot, the following HTML is generated:

<p>Hotspot can generate links to sections such as the section about <a href="#ididref">ID/IDREF</a>, this link is then translated into the appropriate HTML code, meaning a link with the target being a fragment identifier to the slide number.</p>


ID/IDREF R. Larson: Document Type Definition (DTD)

(28) ID/IDREF Semantics

  • Rooted in the document world
    • all parts are assembled before processing
    • names are symbolic and assigned as required
    • mixed syntax and semantics
  • Good idea, but many shortcomings
    • constraints apply to one document only
    • IDs and IDREFs are global instead of scoped
    • identifiers should be allowed to use any type
    • identifier processing should be type-specific (2 ≟ 02)
  • Applications must know how to process ID/IDREF
    • for HTML export, links can be generated
    • for databases, keys should be used


Entities

Entities R. Larson: Document Type Definition (DTD)

(30) General Entities

  • XML's core concept of physical data structures
    • an entity is a named unit of data which can be referenced
    • within documents, it is referenced by the markup &entity-name;
  • Entities can be used to name and reuse document content
<!ENTITY aacute "&#225;"> <!-- latin small letter a with acute,
                                  U+00E1 ISOlat1 -->
<!ENTITY acirc  "&#226;"> <!-- latin small letter a with circumflex,
                                  U+00E2 ISOlat1 -->
<!ENTITY atilde "&#227;"> <!-- latin small letter a with tilde,
                                  U+00E3 ISOlat1 -->
<!ENTITY auml   "&#228;"> <!-- latin small letter a with diaeresis,
                                  U+00E4 ISOlat1 -->
  • Character References look like entities: &#9786; or &#x263A; = ☺
    • they can be used to represent any Unicode character, they are processed as single characters


Entities R. Larson: Document Type Definition (DTD)

(31) Parameter Entities

  • Parameter entities are parsed entities for use within the DTD
    • a parameter entity must be specifically declared as such
    • within DTDs, it is referenced by the markup %entity-name;
    • outside of DTDs, parameter entities cannot be used
  • As general entities, parameter entities are meant for reuse
    • in a DTD, reuse is mostly about reusing structures
    • parameter entities are DTDs duct tape, not elegant, but effective


Entities R. Larson: Document Type Definition (DTD)

(32) XHTML Parameter Entities (Attributes)

<!ELEMENT p %Inline;>
<!ATTLIST p
  %attrs;
  %TextAlign;
  >
<!ENTITY % attrs "%coreattrs; %i18n; %events;">
<!ENTITY % coreattrs
 "id          ID             #IMPLIED
  class       CDATA          #IMPLIED
  style       %StyleSheet;   #IMPLIED
  title       %Text;         #IMPLIED"
  >
<!ENTITY % i18n
 "lang        %LanguageCode; #IMPLIED
  xml:lang    %LanguageCode; #IMPLIED
  dir         (ltr|rtl)      #IMPLIED"
  >
<!ENTITY % LanguageCode "NMTOKEN">
<!-- a language code, as per [RFC3066] -->
<!ENTITY % TextAlign "align (left|center|right|justify) #IMPLIED">


Entities R. Larson: Document Type Definition (DTD)

(33) XHTML Parameter Entities (Content)

<!ELEMENT p %Inline;>
<!ATTLIST p
  %attrs;
  %TextAlign;
  >
<!ENTITY % Inline "(#PCDATA | %inline; | %misc.inline;)*">
<!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;">
<!ENTITY % special
   "%special.basic; | %special.extra;">
<!ENTITY % special.basic
 "br | span | bdo">
<!ENTITY % special.extra
   "object | applet | img | map | iframe">
<!ENTITY % misc.inline "ins | del | script">


Conclusions

Outline (Conclusions)

  1. Schema Languages [5]
  2. DTD Basics [9]
    1. DTD Syntax [2]
    2. Defining Elements [3]
    3. Defining Attribute Lists [2]
  3. Advanced DTDs [9]
    1. ID/IDREF [5]
    2. Entities [4]
  4. Conclusions [2]
Conclusions R. Larson: Document Type Definition (DTD)

(35) DTD for XML Schemas



Conclusions R. Larson: Document Type Definition (DTD)

(36) Modeling DTDs



2011-09-01 XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)