Hypertext Markup Language (HTML)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
Abstract
The Hypertext Markup Language (HTML) is the language for providing Web content. It is based on the idea of structuring content, while the layout should be controlled using stylesheet languages. HTML document have a document head which allows HTML documents to contain document metadata. One of the most important tasks of HTML is to serve as an interface for data input for Web-based applications. This can be done by using HTML Forms .
Creating HTML
HTML is a standardized languageCSS and scripting turn HTML into a complex environment exhaustive testing is very hard (browser × version × OS × language × screen size × ...) Web design is only one part of Web publishing Design questions should be addressed as one part of a Web publishing HTML is a format for structured content, not a page description language Cascading Stylesheets (CSS) provide a separation between content and formatting
Web Design Today
HTML Alternatives
Macromedia Flash Flash is not a Web technology Flash applications are stand-alone application executed by the browser they break most principles which are important for Web publishing they are useful for stand-alone applications (information kiosks) they are acceptable as long as they do not have state Portable Document Format (PDF) PDF is not a Web technology PDF is optimized for representing paginated documents they break most principles which are important for Web publishing set up a publishing pipeline which provides PDF as a resource variant
Characters and Computers
American Standard Code for Information Interchange (ASCII) for the first time a basic set of characters had a universally accepted encoding many Internet Protocols encode their information in ASCII commands ASCII is a very limited repertoire of charactersbasic ASCII contains 128 characters (7 bit) with a number of control chars no variants of characters (german umlauts, french accents) are supported various code pages extending ASCII to 8 bit exist and are hard to distinguish Character is not a trivial concept when regarded globallyeuropean languages all have writing systems based on a small number of atoms other languages and writing systems have vastly different ideas of language atoms
Characters
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape [...]
The Unicode Standard, Version 4.0 , Addison-Wesley, 2003
The alphabetic approach is only one of several possibilitiesA character in Japanese hiragana and katakana scripts corresponds to a syllable (usually a combination of consonant plus vowel) Korean Hangul combines symbols for individual sounds of the language into square blocks, each of which represents a syllable; depending on the user and the application, either the individual symbols or the syllabic clusters can be considered to be charactersIn Indic scripts each consonant letter carries an inherent vowel that is eliminated or replaced using semi-regular or irregular ways to combine consonants and vowels into clusters; depending on the user and the application, either individual consonants or vowels, or the consonant or consonant-vowel clusters can be perceived as characters Arabic and Hebrew vowel sounds are typically not written at all; when they are written they are indicated by the use of combining marks placed above and below the consonantal letters
Glyphs
[A Glyph is] a recognizable abstract graphic symbol which is independent of a specific design.
ISO/IEC 9541:1991, Information Technology – Font Information Interchange
Visual rendering introduces the notion of a glyph.There is not a one-to-one correspondence between characters and glyphsA single character can be represented by multiple glyphs (each glyph is then part of the representation of that character); these glyphs may be physically separated from one another A single glyph may represent a sequence of characters (this is the case with ligatures, among others) A character may be rendered with very different glyphs depending on the context A single glyph may represent different characters (e.g. capital Latin A, capital Greek A and capital Cyrillic A)
Unicode
Generalization takes characters beyond one and even two bytesUnicode recently added its 100'000th character a more structured approach is required Unicode cleanly separates various conceptual stepscharacters are collected and are then part of the character repertoire characters are then identified by a unique code point (U+0041
) a Character Encoding Scheme (CES) then maps the Coded Character Set (CCS) based on a Character Encoding Form (CEF) XML is ASCII for the 21st century purists sometimes consider Unicode too big or dangerous Unicode is well-established and is necessary in a globalized economy
Unicode Character Count
Unicode has the ability to encode 1'114'112 charactersthis means that currently ~10% of the available space is used Characters are organized into 17 planes of 216 = 65'536 characters Planes are numbered from 0 to 16 Plane 0 is the Basic Multilingual Plane (BMP) all characters which are used today are part of the BMP Planes beyond the BMP contain rare and historic charactersOld Italic , Deseret , Byzantine Musical Symbols most space within these astral planes is empty
Unicode Encodings
A א 好
Code point U+0041 U+05D0 U+597D U+233B4
UTF-8 41 D7 90 E5 A5 BD F0 A3 8E B4
UTF-16 00 41 05 D0 59 7D D8 4C DF B4
UTF-32 00 00 00 41 00 00 05 D0 00 00 59 7D 00 02 33 B4
UTF-8
UTF-8 is one of the two standardized encodings for XMLevery ASCII document by definition is a UTF-8 document UTF-8 must be supported by every XML implementation UTF-8 is not trivial, but it is widely supported and easy to implementthere is no 1:1 correspondence between bytes and characters each Unicode character is encoded by 1-6 bytes UTF-8 is good for europeans (1 byte per ASCII character) Non-Unicode documents must be transcoded to UTF-8keeping track of resource character encodings is a good idea
Other UTFs
UTF-16 stores every character as 2 or more bytesBMP characters are stored as 2 bytes astral plane characters are stored as 4 bytes UTF-16 is the other encoding (in addition to UTF-8) required by XML UTF-32 stores every character as 4 bytesvery simple and very inefficient (ASCII text volume increases by 300%) Multi-byte formats introduce the problem of byte order UTF-16/32BE and UTF-16/32LE are stored with guaranteed endian UTF-16/32 may use a Byte Order Mark (BOM) (U+FEFF) to detect the endian
Character Set Identification
HTTP, XML, and HTML support character set identificationHTTP supports the Content-Type
header field Content-Type: text/html; charset=utf-8 XML encodes the character set in the XML declaration<?xml version="1.0" encoding="utf-8"?> HTML supports the meta element in the document's head <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> Different identifications serve different purposes Conflicting identifications are a sign of management problems
HTML Document Structure
HTML Metadata
HTML documents contain content and metadatathe only mandatory metadata is the Document Title more metadata improves the way how documents can be handled Document content is what is rendered in the browser windowthe visual part of a Web page there are very little semantics associated with document contents Document metadata is what helps clients to manage documentsadditional information (which may be ignored if not supported) metadata usually have well-defined semantics
Document Title
The only mandatory part of HTML document metadata Used for various aspects in browsersthe window title for document windows the name for bookmarks and other document names Used in other places where documents should have namesresult lists of search engines generated navigation links (site maps) Document titles should be short and context-freeshort because they often will be used in list-like environmentscontext-free because they often will be used outside of the page's context
Metadata
Document metadata is specified with the meta elementname
identifies a metadata property (there is no standardized list)content
specifies the property value<meta name="Author" content="Erik Wilde">
<meta name="copyright" content="© 2007 dret.net"> Client-side support for metadata varies and is hard to predictspamming targets can be safely omitted (keywords
)if information can be generated automatically it should be included meta can also simulate the presence of HTTP headers<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Refresh" content="10; URL=http://whatever.com/">
<meta http-equiv="PICS-Label" content="(PICS-1.1 ... )">
Links
Metadata links specify connections to other resourcesthey simply are a special form of metadata rel
identifies the relation to the linked resourcehref
specifies the URI of the linked resourceexotic attributes are hreflang
/charset
(resource language and character set) and rev
(reverse rel
) Links can be very useful but may easily breakthey can be easily kept up-to-date in a controlled environment if links can be generated automatically they should be included apart from links to External Stylesheets , metadata links are used rarely
Document Relationships
Web sites often have site map informationsite maps and other document relationships are useful navigation aids most browsers force Web designers to implement their own navigation Document relationships can be specified as document metadatanavigation information is accessible as machine-readable information alternate, stylesheet, start, next, prev, contents, index, glossary, copyright, chapter, section, subsection, appendix, help Usability and accessibility can be improved with document relationshipsmost browsers ignore this information if relationships can be generated automatically they should be included
HTTP Language Specification Issues
HTTP allows the specification of the Content-Language
HTTP headers are transientthey are not saved when the document is saved they might not be cached when the document is cached Setting HTTP headers requires server configurationthe server must be able to detect the language of a resource the server must be configured to set the appropriate HTTP header Web sites should provide HTTP and HTML language identificationeasy to set up correctly in a controlled publishing environment
HTTP Web Services
Services can be provided through URI/HTTPURI-based services need input as a query string the question is how the user gets information into the URI HTML forms provide an interface for assembling query stringsusers fill out a form providing several fields the browser submits the entered information by HTTP to a URI the result of the request is displayed to the user HTTP has to different methods for submitting dataGET
encodes the data as a URI query stringPOST
encodes the data as HTTP request entity
Forms Mechanics
HTML forms are normal Web pages (using form elements) The process receiving the form data produces a result page
Forms Markup
All form elements must be inside a form elementspecifies the URI for submitting the form values (action="URI" ) specifies the method for submitting the form values (method="GET |POST " ) for POST
forms, the encoding may be selected form contains regular HTML markup and form elementsthe regular HTML markup creates the form's layout (table, list, texts) the form elements create the controls for acquiring input data Each form should have a submit button when pressing this button, the form values are sent to the action URI without such a button, the form values cannot be submitted
Forms Elements (User View)
HTML provides a small set of form controls Sufficient for a many applications
Forms Elements (Source View)
<form action="http://stevex.net/dump.php" method="POST" enctype="multipart/form-data"><table>
<tr><td>Text:</td><td><input type="text" name="text" value="text input"/></td></tr>
<tr><td>Password:</td><td><input type="password" name="password" value="hidden text"/></td></tr>
<tr><td>Checkbox:</td><td><input type="checkbox" name="check" value="1"/> <input type="checkbox" name="check" value="2"/> <input type="checkbox" name="check" value="3"/></td></tr>
<tr><td>Radio Button:</td><td><input type="radio" name="radio" value="1"/> <input type="radio" name="radio" value="2"/> <input type="radio" name="radio" value="3"/></td></tr>
<tr><td>Submit:</td><td><input name="submit" type="submit"/></td></tr>
<tr><td>File Upload:</td><td><input name="file" type="file"/></td></tr>
<tr><td>Text Areas:</td><td><textarea name="textarea" rows="2" cols="20"/></td></tr>
<tr><td>Selection:</td><td><select name="select"><option selected="selected">XML</option><option>SGML</option></select></td></tr>
<tr><td>Multiple Selection:</td><td><select name="mselect" multiple="multiple"><option>242</option><option>290-3</option><option>290-13</option></select>
</table></form>
Forms and GET
Limited to string-oriented form valuesbut HTML forms also allow file upload (this requires POST
) All values of all form input fields are collectedfor text and selection fields, this is one input field for checkboxes and radio buttons, this collects the selected fields The browser composes a URI query stringthe form submission is a set of name/value pairs (names may appear more than once!) using URI query string notation, it is appended to the URI of the form's action GET
is good!URI-encoded queries can be bookmarked and otherwise reused (e.g., cached) if possible, use GET
when implementing a form
HTTP POST
GET
encodes the values in the URIfor file uploads, this is not possible HTTP's POST
request method can upload data POST
sends a request containing an entitythe HTTP request then looks similar to a response (header fields and entity) the receiving process (the Web server) accepts the POST body Entities can use any format (it is specified in a header field)just like e-mails, entities can have multiple parts the parts are separated using the standard MIME mechanism
Forms and POST
POST
is used if the form specifies itit can (but should not) be used for non-file forms it should be used for file upload forms (otherwise, only the name is uploaded) File upload forms must specify the appropriate encodingapplication/x-www-form-urlencoded
is the default (values in the entity)multipart/form-data
is required for file upload (multipart form data) The server side must be prepared to receive POST
requestsit must parse the entity rather than the URI's query string form values can then be extracted from the entity some environments (e.g., PHP ) allow to handle GET
/POST
transparently
Structuring Forms
HTML forms are very loosely structuredform somewhere representing the containerinside the form a random collection of HTML and form inputs Visually, the structure often is (and should be) easy to seefor non-visual access, more structure must be provided accessibility has become a major issue on the Web Accessibility has many different facetsvoice browsers must be able to read aloud Web forms gateways should be able to intelligently re-structure Web forms
Labels
Label and form control are not connected by HTML
<tr><td>Text:</td><td><input type="text" name="text"/></td></tr>
<tr><td>Password:</td><td><input type="password" name="password"/></td></tr>
The label element allows to make this connectionit connects a form control with the describing label this association is now accessible to clients for processing
<tr>
<td><label for="textctrl">Text:</label></td>
<td><input type="text" name="text" id="textctrl"/></td>
</tr>
<tr>
<td><label for="pwdctrl">Password:</label></td>
<td><input type="password" name="password" id="pwdctrl"/></td>
</tr>
Fieldsets
Complex forms may need structuringgroups of controls for subsets of the collected data this structure should be represented in markup
<fieldset><legend>Billing</legend>billing form controls...</fieldset>
<fieldset><legend>Shipping</legend>shipping form controls...</fieldset>
Excellent example for HTML's markup design philosophyif a client does not support fieldsets, the elements are ignored the title of the fieldset will be displayed, because it is text
Billing billing form controls...
Shipping shipping form controls...
Tabbing in Forms
Tabbing is a very convenient way of navigating a formafter completing one field, users should be taken to the next the order should be defined by the form creator, not by accident the tabindex attribute defines the tabbing orderit contains a number which is interpreted relative to other numbers all form controls may carry a tabindex attribute tabindex 1-9:
1
7
3
6
8
2
1
4
5
9
Disabled and Readonly Controls
In complex scenarios, certain controls may be disabled or readonlybased on a workflow, some controls may not apply in all cases for the sake of a consistent view, they should still be included in the interface Disabled controls are not usedthey cannot be tabbed to and never receive focus their value is not included in the form's submission data Readonly controls cannot be changedthey can be tabbed to and may receive focus their value may not be changed (important for radio buttons and checkboxes!) their value is included in the form's submission data Important: Never trust the Browser!
Disabled and Readonly Controls Display
HTML and XML
HTML traditionally was never parsedbrowsers attempt to render a page at all costs HTML is more tag soup than a validated structured document XML has been introduced for machine-readable structured contentXML's syntax is almost (but not completely) like HTML syntax XML must be well-formed, parsing errors must be reported XML-compliant HTML could be a step towards better HTML
Transforming HTML to XHTML
HTML in many cases is not valid Transforming HTML to XHTML first requires clean HTMLthe clean-up process can be pretty expensive the actual conversion of clean HTML to XHTML is easy Tools exist which try to automatically clean up HTMLthis can be hard because invalid HTML must be processed speculatively flexible tools allow a lot of configuration to work well with various inputs based on the quality of the HTML, configuration can be quick or tedious
Transforming XHTML to HTML
XHTML is almost HTMLthe syntax differences are minor (most importantly, empty elements) it has been designed to not cause problems in browsers there are very few cases where HTML syntax is really required XHTML is always well-formed and very often valid
XHTML 1.0
A reformulation of HTML in XML changes nothing about the elements or attributes lower-case names and XML syntax are the most important issues XHTML 1.0 is nothing more than a clean version of HTML 4.01XML conformance requires documents to be correctly structured XHTML 1.0 documents can be processed with standard XML tools Replacing HTML with XHTML 1.0 can be a step to better HTMLdocuments have to be valid HTML to be turned into XHTML 1.0 Using XHTML 1.0 can be a step to an XML-based workflowvalid XHTML 1.0 documents can be turned into CMS-like XML XML can then serve as input for various presentation formats (XHTML, PDF)
XHTML 1.1
Modularization of XHTML 1.0 Strict all deprecated elements and attributes are no longer allowed Conformance is tightened to documents which do not extend the languagedocuments are not allowed to use additional element or attributes many documents will not fit into this tighter concept of conformance
XHTML 2.0
Completely new version of (X)HTML with no backwards compatibilitythe name is more marketing than reality XHTML 2.0 might become a server-side language for quality-conscious Web publishers Many issues from previous HTML versions have been addressedHTML Forms are replaced by XForms for better forms on the WebHTML frames are replaced by XFrames nl for nested lists directly supports nested navigation listsany element can become a link by having an href
attribute img alt is removed and the alt
text becomes img contentheadings are represented by h and section s are properly nested
HTML as Information
HTML is a language for structured information HTML documents are part of an information system Web design is the very last step of the publishing pipeline Being a good citizen on the Web is not very hard