Abstract

The Hypertext Markup Language (HTML) is the language for providing Web content. It is based on the idea of structuring content, while the layout should be controlled using stylesheet languages. HTML document have a document head which allows HTML documents to contain document metadata. One of the most important tasks of HTML is to serve as an interface for data input for Web-based applications. This can be done by using HTML Forms.

Creating HTML

HTML is a standardized language
- CSS and scripting turn HTML into a complex environment
- exhaustive testing is very hard (browser × version × OS × language × screen size × ...)
Web design is only one part of Web publishing
Design questions should be addressed as one part of a Web publishing
- HTML is a format for structured content, not a page description language
- Cascading Stylesheets (CSS) provide a separation between content and formatting

HTML Alternatives

Macromedia Flash
- Flash is not a Web technology
- Flash applications are stand-alone application executed by the browser
- they break most principles which are important for Web publishing
- they are useful for stand-alone applications (information kiosks)
- they are acceptable as long as they do not have state
Portable Document Format (PDF)
- PDF is not a Web technology
- PDF is optimized for representing paginated documents
- they break most principles which are important for Web publishing
- set up a publishing pipeline which provides PDF as a resource variant

Outline (Unicode)

Unicode [9]
HTML Document Head [7]
Language Identification [2]
HTML Forms [15]
HTML and XHTML [6]

Characters and Computers

American Standard Code for Information Interchange (ASCII)
- for the first time a basic set of characters had a universally accepted encoding
- many Internet Protocols encode their information in ASCII commands
ASCII is a very limited repertoire of characters
- basic ASCII contains 128 characters (7 bit) with a number of control chars
- no variants of characters (german umlauts, french accents) are supported
- various code pages extending ASCII to 8 bit exist and are hard to distinguish
Character is not a trivial concept when regarded globally
- european languages all have writing systems based on a small number of atoms
- other languages and writing systems have vastly different ideas of language atoms

Characters

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape [...]

The Unicode Standard, Version 4.0, Addison-Wesley, 2003

The alphabetic approach is only one of several possibilities
- A character in Japanese hiragana and katakana scripts corresponds to a syllable (usually a combination of consonant plus vowel)
- Korean Hangul combines symbols for individual sounds of the language into square blocks, each of which represents a syllable; depending on the user and the application, either the individual symbols or the syllabic clusters can be considered to be characters
- In Indic scripts each consonant letter carries an inherent vowel that is eliminated or replaced using semi-regular or irregular ways to combine consonants and vowels into clusters; depending on the user and the application, either individual consonants or vowels, or the consonant or consonant-vowel clusters can be perceived as characters
- Arabic and Hebrew vowel sounds are typically not written at all; when they are written they are indicated by the use of combining marks placed above and below the consonantal letters

Glyphs

[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.

ISO/IEC 9541:1991, Information Technology – Font Information Interchange

Visual rendering introduces the notion of a glyph.
There is not a one-to-one correspondence between characters and glyphs
- A single character can be represented by multiple glyphs (each glyph is then part of the representation of that character); these glyphs may be physically separated from one another
- A single glyph may represent a sequence of characters (this is the case with ligatures, among others)
- A character may be rendered with very different glyphs depending on the context
- A single glyph may represent different characters (e.g. capital Latin A, capital Greek A and capital Cyrillic A)

Unicode

Generalization takes characters beyond one and even two bytes
- Unicode recently added its 100'000^th character
- a more structured approach is required
Unicode cleanly separates various conceptual steps
- characters are collected and are then part of the character repertoire
- characters are then identified by a unique code point (U+0041)
- a Character Encoding Scheme (CES) then maps the Coded Character Set (CCS) based on a Character Encoding Form (CEF)
XML is ASCII for the 21^st century
- purists sometimes consider Unicode too big or dangerous
- Unicode is well-established and is necessary in a globalized economy

Unicode Character Count

Unicode has the ability to encode 1'114'112 characters
- this means that currently ~10% of the available space is used
Characters are organized into 17 planes of 2¹⁶ = 65'536 characters
Planes are numbered from 0 to 16
Plane 0 is the Basic Multilingual Plane (BMP)
- all characters which are used today are part of the BMP
Planes beyond the BMP contain rare and historic characters
- Old Italic, Deseret, Byzantine Musical Symbols
- most space within these astral planes is empty

Unicode Encodings

	A	א	好
Code point	U+0041	U+05D0	U+597D	U+233B4
UTF-8	41	D7 90	E5 A5 BD	F0 A3 8E B4
UTF-16	00 41	05 D0	59 7D	D8 4C DF B4
UTF-32	00 00 00 41	00 00 05 D0	00 00 59 7D	00 02 33 B4

UTF-8

UTF-8 is one of the two standardized encodings for XML
- every ASCII document by definition is a UTF-8 document
- UTF-8 must be supported by every XML implementation
UTF-8 is not trivial, but it is widely supported and easy to implement
- there is no 1:1 correspondence between bytes and characters
- each Unicode character is encoded by 1-6 bytes
- UTF-8 is good for europeans (1 byte per ASCII character)
Non-Unicode documents must be transcoded to UTF-8
- keeping track of resource character encodings is a good idea

Other UTFs

UTF-16 stores every character as 2 or more bytes
- BMP characters are stored as 2 bytes
- astral plane characters are stored as 4 bytes
- UTF-16 is the other encoding (in addition to UTF-8) required by XML
UTF-32 stores every character as 4 bytes
- very simple and very inefficient (ASCII text volume increases by 300%)
Multi-byte formats introduce the problem of byte order
- UTF-16/32BE and UTF-16/32LE are stored with guaranteed endian
- UTF-16/32 may use a Byte Order Mark (BOM) (U+FEFF) to detect the endian

Character Set Identification

HTTP, XML, and HTML support character set identification
- HTTP supports the Content-Type header field
```
Content-Type: text/html; charset=utf-8
```
- XML encodes the character set in the XML declaration
```
<?xml version="1.0" encoding="utf-8"?>
```
- HTML supports the meta element in the document's head
```
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
```
Different identifications serve different purposes
Conflicting identifications are a sign of management problems

Outline (HTML Document Head)

Unicode [9]
HTML Document Head [7]
Language Identification [2]
HTML Forms [15]
HTML and XHTML [6]

HTML Document Structure

HTML Metadata

HTML documents contain content and metadata
- the only mandatory metadata is the Document Title
- more metadata improves the way how documents can be handled
Document content is what is rendered in the browser window
- the visual part of a Web page
- there are very little semantics associated with document contents
Document metadata is what helps clients to manage documents
- additional information (which may be ignored if not supported)
- metadata usually have well-defined semantics

Document Title

The only mandatory part of HTML document metadata
Used for various aspects in browsers
- the window title for document windows
- the name for bookmarks and other document names
Used in other places where documents should have names
- result lists of search engines
- generated navigation links (site maps)
Document titles should be short and context-free
- short because they often will be used in list-like environments
- context-free because they often will be used outside of the page's context

Metadata

Document metadata is specified with the meta element
- name identifies a metadata property (there is no standardized list)
- content specifies the property value
```
<meta name="Author" content="Erik Wilde">
<meta name="copyright" content="© 2007 dret.net">
```
Client-side support for metadata varies and is hard to predict
- spamming targets can be safely omitted (keywords)
- if information can be generated automatically it should be included

meta can also simulate the presence of HTTP headers

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Refresh" content="10; URL=http://whatever.com/">
<meta http-equiv="PICS-Label" content="(PICS-1.1 ... )">

Links

Metadata links specify connections to other resources
- they simply are a special form of metadata
- rel identifies the relation to the linked resource
- href specifies the URI of the linked resource
- exotic attributes are hreflang/charset (resource language and character set) and rev (reverse rel)
Links can be very useful but may easily break
- they can be easily kept up-to-date in a controlled environment
- if links can be generated automatically they should be included
- apart from links to External Stylesheets, metadata links are used rarely

Document Relationships

Web sites often have site map information
- site maps and other document relationships are useful navigation aids
- most browsers force Web designers to implement their own navigation
Document relationships can be specified as document metadata
- navigation information is accessible as machine-readable information
- alternate, stylesheet, start, next, prev, contents, index, glossary, copyright, chapter, section, subsection, appendix, help
Usability and accessibility can be improved with document relationships
- most browsers ignore this information
- if relationships can be generated automatically they should be included

External Stylesheets

Stylesheets are specific for a medium and use some stylesheet language
- media specifies the target media (e.g., screen or print)
- href specifies the URI of the linked resource
```
<link rel="stylesheet" type="text/css" media="screen, projection, print" href="xslidy/xslidy.css">
```
Cascading Stylesheets (CSS) should be managed as separate files
- they can be reused (and thus be cached) for various HTML documents
- clients can ignore stylesheets based on media types

Outline (Language Identification)

Unicode [9]
HTML Document Head [7]
Language Identification [2]
HTML Forms [15]
HTML and XHTML [6]

Language Specification

Languages can be specified at different places in the document

XHTML and Metadata provide various methods

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Language" content="en" />

Languages can also be specified as a HTTP header field
```
Content-Language: en
```
HTTP-oriented approaches allow to specify a list of languages
Always specify the language in an html attribute
- for HTML use lang only
- for XHTML served as text/html use lang and xml:lang
- for XHTML served as application/xhtml+xml use xml:lang only

HTTP Language Specification Issues

HTTP allows the specification of the Content-Language
HTTP headers are transient
- they are not saved when the document is saved
- they might not be cached when the document is cached
Setting HTTP headers requires server configuration
- the server must be able to detect the language of a resource
- the server must be configured to set the appropriate HTTP header
Web sites should provide HTTP and HTML language identification
- easy to set up correctly in a controlled publishing environment

Outline (HTML Forms)

Unicode [9]
HTML Document Head [7]
Language Identification [2]
HTML Forms [15]
HTML and XHTML [6]

HTTP Web Services

Services can be provided through URI/HTTP
- URI-based services need input as a query string
- the question is how the user gets information into the URI
HTML forms provide an interface for assembling query strings
- users fill out a form providing several fields
- the browser submits the entered information by HTTP to a URI
- the result of the request is displayed to the user
HTTP has to different methods for submitting data
- GET encodes the data as a URI query string
- POST encodes the data as HTTP request entity

Forms Mechanics

HTML forms are normal Web pages (using form elements)
The process receiving the form data produces a result page

Forms Markup

All form elements must be inside a form element
- specifies the URI for submitting the form values (action="URI")
- specifies the method for submitting the form values (method="GET|POST")
- for POST forms, the encoding may be selected
form contains regular HTML markup and form elements
- the regular HTML markup creates the form's layout (table, list, texts)
- the form elements create the controls for acquiring input data
Each form should have a submit button
- when pressing this button, the form values are sent to the action URI
- without such a button, the form values cannot be submitted

Forms Elements (User View)

HTML provides a small set of form controls
Sufficient for a many applications

Forms Elements (Source View)

<form action="http://stevex.net/dump.php" method="POST" enctype="multipart/form-data"><table>

<tr><td>Text:</td><td><input type="text" name="text" value="text input"/></td></tr>

<tr><td>Password:</td><td><input type="password" name="password" value="hidden text"/></td></tr>

<tr><td>Checkbox:</td><td><input type="checkbox" name="check" value="1"/> <input type="checkbox" name="check" value="2"/> <input type="checkbox" name="check" value="3"/></td></tr>

<tr><td>Radio Button:</td><td><input type="radio" name="radio" value="1"/> <input type="radio" name="radio" value="2"/> <input type="radio" name="radio" value="3"/></td></tr>

<tr><td>Submit:</td><td><input name="submit" type="submit"/></td></tr>

<tr><td>File Upload:</td><td><input name="file" type="file"/></td></tr>

<tr><td>Text Areas:</td><td><textarea name="textarea" rows="2" cols="20"/></td></tr>

<tr><td>Selection:</td><td><select name="select"><option selected="selected">XML</option><option>SGML</option></select></td></tr>

<tr><td>Multiple Selection:</td><td><select name="mselect" multiple="multiple"><option>242</option><option>290-3</option><option>290-13</option></select>

</table></form>

Forms and GET

Limited to string-oriented form values
- but HTML forms also allow file upload (this requires POST)
All values of all form input fields are collected
- for text and selection fields, this is one input field
- for checkboxes and radio buttons, this collects the selected fields
The browser composes a URI query string
- the form submission is a set of name/value pairs (names may appear more than once!)
- using URI query string notation, it is appended to the URI of the form's action
GET is good!
- URI-encoded queries can be bookmarked and otherwise reused (e.g., cached)
- if possible, use GET when implementing a form

HTTP POST

GET encodes the values in the URI
- for file uploads, this is not possible
- HTTP's POST request method can upload data
POST sends a request containing an entity
- the HTTP request then looks similar to a response (header fields and entity)
- the receiving process (the Web server) accepts the POST body
Entities can use any format (it is specified in a header field)
- just like e-mails, entities can have multiple parts
- the parts are separated using the standard MIME mechanism

Forms and POST

POST is used if the form specifies it
- it can (but should not) be used for non-file forms
- it should be used for file upload forms (otherwise, only the name is uploaded)
File upload forms must specify the appropriate encoding
- application/x-www-form-urlencoded is the default (values in the entity)
- multipart/form-data is required for file upload (multipart form data)
The server side must be prepared to receive POST requests
- it must parse the entity rather than the URI's query string
- form values can then be extracted from the entity
- some environments (e.g., PHP) allow to handle GET/POST transparently

Processing of Form Data

Form data is always encoded
- as a query string when using GET
- in an encoded entity when using POST
- in a multipart entity when using POST with multipart/form-data
Parsing the form data should be done by existing software
- most Web-aware programming environments provide this functionality
- PHP allows access through different mechanisms

Query Parameter Test:
<?php
import_request_variables("gP", "form_");
echo $form_test;
?>

Structuring Forms

HTML forms are very loosely structured
- form somewhere representing the container
- inside the form a random collection of HTML and form inputs
Visually, the structure often is (and should be) easy to see
- for non-visual access, more structure must be provided
- accessibility has become a major issue on the Web
Accessibility has many different facets
- voice browsers must be able to read aloud Web forms
- gateways should be able to intelligently re-structure Web forms

Labels

Label and form control are not connected by HTML

<tr><td>Text:</td><td><input type="text" name="text"/></td></tr>
<tr><td>Password:</td><td><input type="password" name="password"/></td></tr>

The label element allows to make this connection
- it connects a form control with the describing label
- this association is now accessible to clients for processing

<tr>
<td><label for="textctrl">Text:</label></td>
<td><input type="text" name="text" id="textctrl"/></td>
</tr>
<tr>
<td><label for="pwdctrl">Password:</label></td>
<td><input type="password" name="password" id="pwdctrl"/></td>
</tr>

Fieldsets

Complex forms may need structuring
- groups of controls for subsets of the collected data
- this structure should be represented in markup

<fieldset><legend>Billing</legend>billing form controls...</fieldset>
<fieldset><legend>Shipping</legend>shipping form controls...</fieldset>

Excellent example for HTML's markup design philosophy
- if a client does not support fieldsets, the elements are ignored
- the title of the fieldset will be displayed, because it is text

Billingbilling form controls...

Shippingshipping form controls...

Tabbing in Forms

Tabbing is a very convenient way of navigating a form
- after completing one field, users should be taken to the next
- the order should be defined by the form creator, not by accident
the tabindex attribute defines the tabbing order
- it contains a number which is interpreted relative to other numbers
- all form controls may carry a tabindex attribute
tabindex 1-9:

Disabled and Readonly Controls

In complex scenarios, certain controls may be disabled or readonly
- based on a workflow, some controls may not apply in all cases
- for the sake of a consistent view, they should still be included in the interface
Disabled controls are not used
- they cannot be tabbed to and never receive focus
- their value is not included in the form's submission data
Readonly controls cannot be changed
- they can be tabbed to and may receive focus
- their value may not be changed (important for radio buttons and checkboxes!)
- their value is included in the form's submission data
Important: Never trust the Browser!

Disabled and Readonly Controls Display

	Normal Control	Disabled Control	Readonly Control
Text:
Password:
Checkbox:			!
Radio Button:			!
File Upload:			!
Text Areas:	initial text	initial text	initial text
Selection:			[ not supported ]
Multiple Selection:			[ not supported ]

Outline (HTML and XHTML)

Unicode [9]
HTML Document Head [7]
Language Identification [2]
HTML Forms [15]
HTML and XHTML [6]

HTML and XML

HTML traditionally was never parsed
- browsers attempt to render a page at all costs
- HTML is more tag soup than a validated structured document
XML has been introduced for machine-readable structured content
- XML's syntax is almost (but not completely) like HTML syntax
- XML must be well-formed, parsing errors must be reported
XML-compliant HTML could be a step towards better HTML

Transforming HTML to XHTML

HTML in many cases is not valid
Transforming HTML to XHTML first requires clean HTML
- the clean-up process can be pretty expensive
- the actual conversion of clean HTML to XHTML is easy
Tools exist which try to automatically clean up HTML
- this can be hard because invalid HTML must be processed speculatively
- flexible tools allow a lot of configuration to work well with various inputs
- based on the quality of the HTML, configuration can be quick or tedious

Transforming XHTML to HTML

XHTML is almost HTML
- the syntax differences are minor (most importantly, empty elements)
- it has been designed to not cause problems in browsers
- there are very few cases where HTML syntax is really required
XHTML is always well-formed and very often valid
- machine-based transformation can start from a solid foundation
- this direction is much cheaper than Transforming HTML to XHTML

XHTML 1.0

A reformulation of HTML in XML
- changes nothing about the elements or attributes
- lower-case names and XML syntax are the most important issues
XHTML 1.0 is nothing more than a clean version of HTML 4.01
- XML conformance requires documents to be correctly structured
- XHTML 1.0 documents can be processed with standard XML tools
Replacing HTML with XHTML 1.0 can be a step to better HTML
- documents have to be valid HTML to be turned into XHTML 1.0
Using XHTML 1.0 can be a step to an XML-based workflow
- valid XHTML 1.0 documents can be turned into CMS-like XML
- XML can then serve as input for various presentation formats (XHTML, PDF)

XHTML 1.1

Modularization of XHTML 1.0 Strict
- all deprecated elements and attributes are no longer allowed
Conformance is tightened to documents which do not extend the language
- documents are not allowed to use additional element or attributes
- many documents will not fit into this tighter concept of conformance

XHTML 2.0

Completely new version of (X)HTML with no backwards compatibility
- the name is more marketing than reality
- XHTML 2.0 might become a server-side language for quality-conscious Web publishers
Many issues from previous HTML versions have been addressed
- HTML Forms are replaced by XForms for better forms on the Web
- HTML frames are replaced by XFrames
- nl for nested lists directly supports nested navigation lists
- any element can become a link by having an href attribute
- img alt is removed and the alt text becomes img content
- headings are represented by h and sections are properly nested

Hypertext Markup Language (HTML)

Information Systems and the World Wide Web

International School of New MediaUniversity of Lübeck

Erik Wilde, UC Berkeley School of Information2007-01-03