Overview and Introduction

XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)

Ray Larson, UC Berkeley School of Information
2011-08-25

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents R. Larson: Overview and Introduction

Contents

R. Larson: Overview and Introduction

(2) Abstract

The Extensible Markup Language (XML) has been introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was considered as conveying not enough semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that since its invention has been taken to higher levels with the idea of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data.



R. Larson: Overview and Introduction

(3) XML Executive Summary



R. Larson: Overview and Introduction

(4) What's the Plan?



R. Larson: Overview and Introduction

(5) What are we doing?

Altova XML Spy

Varia

Outline (Varia)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
Varia R. Larson: Overview and Introduction

(7) About Me



Varia R. Larson: Overview and Introduction

(8) About this Course



Varia R. Larson: Overview and Introduction

(9) About these Slides



Varia R. Larson: Overview and Introduction

(10) Additional Resources



What is XML?

Outline (What is XML?)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
What is XML? R. Larson: Overview and Introduction

(12) XML Yin & Yang

yin-yang.png

What is XML Good for?

Outline (What is XML Good for?)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
What is XML Good for? R. Larson: Overview and Introduction

(14) Why Use XML?

  • Because you want to share data
    • share it in a format which is widely used and easy to use
    • enable others to use it on various platforms with existing tools
  • Because you want to share data cheaply
    • it is easier to use XML than to invent something new
    • it is even easier to use an existing XML schema than to invent a new one
  • Because you want to share data openly
    • if you invent new formats, people must process them
    • avoid applying the security through obscurity principle inadvertently
    • application-specific processing should be deferred to higher layers


What is XML Good for? R. Larson: Overview and Introduction

(15) Is XML Self-Describing?

  • XML is often said to be self-describing
    • many people think this is the same as self-explanatory
    • the catch is what exactly it is you refer to by describing
  • Database data cannot live without a database
    • database data is simply content, the structure is provided by a DBMS
    • XML documents have their structure encoded within them
    • compared to database data, XML in fact is self-describing
  • What is the gap between self-describing and self-explanatory?
    • it is impossible to find out how the document could be modified
    • there are no semantics associated with neither structure nor content
    • so self-describing means, you can guess a lot, but you maybe wrong


What is XML not Good for?

Outline (What is XML not Good for?)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
What is XML not Good for? R. Larson: Overview and Introduction

(17) XML is Character-Based

  • XML is not a binary format, it is based on Unicode [XML Basics; Unicode (1)]
    • binary structures cannot (or rather should not) be described using XML
  • Multimedia formats often are binary
    • image formats such as GIF, JPEG, and PNG
    • audio formats such as MP3 and AAC
    • video formats such as MPEG4 and H.264
  • But: multimedia also uses many XML formats
    • vector graphics formats such as Scalable Vector Graphics (SVG)
    • Synchronized Multimedia Integration Language (SMIL) for describing presentations


What is XML not Good for? R. Larson: Overview and Introduction

(18) XML is a Syntax for Trees

  • Not all data is easily represented by trees
    • overlapping markup (multiple views of the same content)
    • graph-like structures which are less constrained than trees
  • What is it that you have in your tree?
    • XML encodes a structure purely on the syntactic level
    • what the structures mean is in no way described by XML
    • XML structures must be accompanied by semantic descriptions


What is XML not Good for? R. Larson: Overview and Introduction

(19) XML Usages

  • XML can be used [@bestpractices]
    • people should be able to use your XML directly using standard tools
    • if they absolutely need a set of special tools, something is wrong
  • XML is hip, so everybody wants to use it
    • many things have been created ad-hoc and without much planning
    • if you start something which is XML-based, use XML responsibly
    • if you have to use some bad XML, complain about it
  • Finding the balance can be hard
    • XML is great for prototyping and experiments
    • once you decide to redesign your XML, it may be too late
    • XML documents may be short-lived, XML schemas are definitely not


Why XML?

Outline (Why XML?)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
Why XML? R. Larson: Overview and Introduction

(21) Web Technologies



Why XML? R. Larson: Overview and Introduction

(22) From Humans to Machines



Pre-XML Problems

Outline (Pre-XML Problems)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
Pre-XML Problems R. Larson: Overview and Introduction

(24) HTML is for Humans

  • HTML is a format for dead ends
    • HTML is good for rendering Web pages
    • HTML is bad for understanding Web pages
    • the browser is a dead end (from a machine's point of view)
  • Web growth in the late 90's was enormous
    • everybody was putting information online
    • but this information was inaccessible for machines
  • How can this information be made accessible to machines?
    • HTML is not the right format (slightly better than fax machines)
    • there was no other widely accepted format for structured data


Pre-XML Problems R. Larson: Overview and Introduction

(25) A Machine-Friendly Web

  • Information should be published in a machine-understandable format
    • HTML is good for rendering Web pages
    • HTML is bad for understanding Web pages
    • understanding is the key term here: application semantics!
  • Information should be published in application-specific formats
    • HTML is one application: Rendering documents for humans
    • machines need other structures to process Web content
  • 1996: W3C Working Group SGML on the Web
    • HTML is just one document type defined with SGML
    • SGML is a very complex and expensive technology
    • how can SGML be made easily and widely usable?


XML on the Web

Outline (XML on the Web)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
XML on the Web R. Larson: Overview and Introduction

(27) SGML, HTML, and XML

  • Standard Generalized Markup Language (SGML)
    • a language for designing document types
    • a very complex standard with many expensive and non-interoperable implementations
  • Hypertext Markup Language (HTML)
    • implements a simple SGML document type [http://www.w3.org/TR/REC-html40/sgml/loosedtd.html]
    • its syntax is SGML syntax [http://www.oasis-open.org/cover/sgmlsyn/sgmlsyn.htm], it is not defined by HTML itself
    • uses very few SGML features, dedicated processors are rather easy to build
  • Extensible Markup Language (XML)
    • a language for designing document types (i.e., classes of documents)
    • a greatly simplified version of SGML, omitting many obscure features
    • a specification with no optional parts!


XML on the Web R. Larson: Overview and Introduction

(28) XML Documents on the Web

  • XML's idea was that content should be published as XML
    • stylesheets could then be used to render human-readable views
    • machines could simply use the underlying XML
  • There are (almost) no XML documents on the Web
    • stylesheet support depends on browsers (software has a long life!)
    • many content providers do not want to publish machine-readable data
  • There are many XML documents behind HTML documents
    • content does not have to be made public in a machine-readable way
    • browser-independent HTML can be produced from XML
    • XML technologies can be leveraged on the server-side


XML on the Web R. Larson: Overview and Introduction

(29) XML Documents Elsewhere

  • XML is not used as intended, but it is very successful
    • as a server-side foundation for Web publishing
    • as a B2B-focused format with no Web publishing in mind
  • XML has been successful because of different reasons
    • being there at the right time (Internet bubble)
    • politically correct (the W3C is OS-agnostic)
    • technically sound (simple and no optional parts)
    • human-readable based on a well-known syntax
    • great for rapid prototyping and experiments


XML Today

XML Today R. Larson: Overview and Introduction

(31) Used Everywhere

  • Very small: Messages from sensors
    • e.g., building automation or car electronics
    • mostly implemented in hardware or firmware
  • Very large: Genome sequences
    • encoding the results of genome analyses
    • yields very large XML documents (several gigabytes)
  • Very different processing requirements
    • very fast processing (time critical applications)
    • memory-conserving processing (very large documents)
    • incremental processing (streaming)
    • random access (only small parts required)


XML Today R. Larson: Overview and Introduction

(32) This Course and XML

  • XML is ASCII for the 21st century
    • information professionals should know and use XML
    • you will see it in many projects
    • you will hopefully use it in many projects
    • you will be able to build and test prototypes very rapidly
  • What do you need for using XML?
    • XML and some kind of schema language
    • XSLT for processing it


Beyond XML

Outline (Beyond XML)

  1. Varia [4]
  2. What is XML? [6]
    1. What is XML Good for? [2]
    2. What is XML not Good for? [3]
  3. Why XML? [9]
    1. Pre-XML Problems [2]
    2. XML on the Web [3]
    3. XML Today [2]
  4. Beyond XML [2]
Beyond XML R. Larson: Overview and Introduction

(34) Sharing Concepts



Beyond XML R. Larson: Overview and Introduction

(35) The Semantic Web



2011-08-25 XML Foundations [./]
Fall 2011 — INFO 242 (CCN 42596)