Hypertext Markup Language (HTML) Abstract The Hypertext Markup Language (HTML) is the most important content type on the Web. Even though it is primarily intended for humans (by presenting formatted pages of textual content), it also has facets that are important for machine-based processing. HTML can be use in a variety of ways, and this lecture looks at some of the important rules that should be observed when creating HTML, for example how to use HTML markup in general and how to create accessible forms.
HTML and WYSIWYG Thinking of HTML as a page-description language is wrong HTML has been designed as a structure-description language Even a lot of Cascading Style Sheets (CSS) style information does not change thisdifferent available fonts fonts with the same family name but different metrics different hyphenation algorithms hyphenation setting defaults hyphenation dictionaries different size spaces different line-breaking algorithms different widow/orphan/keeptogether rules HTML authoring is not the same as typesetting documents Layout Engine Usage All-Purpose Elements HTML elements are supposed to convey structural semanticslists are available in various flavors (ul , ol , dl )various phrase markup elements are available (em , strong , dfn , code , samp , kbd , var , cite , abbr , acronym ) various levels of headings can be used (h1 -h6 ) HTML content should represent structural informationnot all content can be mapped to HTML elements in many cases HTML elements are available (there are even diff elements ) HTML also has all-purpose elements these elements have no semantics and are just containers span is used as an inline containerdiv is used as a block containerall-purpose elements should only be used if no HTML element is available Retain Content Structures HTML should represent content structuresCSS can be used to tweak the formatting (if required) Rich content should be mapped to rich Web pages HTML is just one possible representation of a resourcethe data model of resources should not be limited by HTML richer representations may become available in the future why are there so few Web pages with tel: URIs? HTTP Web Services Services can be provided through URI/HTTPURI-based services need input as a query string the question is how the user gets information into the URI HTML forms provide an interface for assembling query stringsusers fill out a form providing several fields the browser submits the entered information by HTTP to a URI the result of the request is displayed to the user HTTP has to different methods for submitting dataGET
encodes the data as a URI query stringPOST
encodes the data as HTTP request entity Forms Mechanics HTML forms are normal Web pages (using form elements) The process receiving the form data produces a result page Forms Markup All form elements must be inside a form elementspecifies the URI for submitting the form values (action="URI" ) specifies the method for submitting the form values (method="GET |POST " ) for POST
forms, the encoding may be selected form contains regular HTML markup and form elementsthe regular HTML markup creates the form's layout (table, list, texts) the form elements create the controls for acquiring input data Each form should have a submit button when pressing this button, the form values are sent to the action URI without such a button, the form values cannot be submitted Forms Elements (User View) HTML provides a small set of form controls Sufficient for a many applications Forms Elements (Source View) <form action="http://stevex.net/dump.php" method="POST" enctype="multipart/form-data"><table>
<tr><td>Text:</td><td><input type="text" name="text" value="text input"/></td></tr>
<tr><td>Password:</td><td><input type="password" name="password" value="hidden text"/></td></tr>
<tr><td>Checkbox:</td><td><input type="checkbox" name="check" value="1"/> <input type="checkbox" name="check" value="2"/> <input type="checkbox" name="check" value="3"/></td></tr>
<tr><td>Radio Button:</td><td><input type="radio" name="radio" value="1"/> <input type="radio" name="radio" value="2"/> <input type="radio" name="radio" value="3"/></td></tr>
<tr><td>Submit:</td><td><input name="submit" type="submit"/></td></tr>
<tr><td>File Upload:</td><td><input name="file" type="file"/></td></tr>
<tr><td>Text Areas:</td><td><textarea name="textarea" rows="2" cols="20"/></td></tr>
<tr><td>Selection:</td><td><select name="select"><option selected="selected">XML</option><option>SGML</option></select></td></tr>
<tr><td>Multiple Selection:</td><td><select name="mselect" multiple="multiple"><option>242</option><option>290-3</option><option>290-13</option></select>
</table></form> Forms and GET Limited to string-oriented form valuesbut HTML forms also allow file upload (this requires POST
) All values of all form input fields are collectedfor text and selection fields, this is one input field for checkboxes and radio buttons, this collects the selected fields The browser composes a URI query stringthe form submission is a set of name/value pairs (names may appear more than once!) using URI query string notation, it is appended to the URI of the form's action GET
is good!URI-encoded queries can be bookmarked and otherwise reused (e.g., cached) if possible, use GET
when implementing a form Forms and POST GET
encodes the values in the URIfor file uploads, this is not possible HTTP's POST
request method can upload data POST
sends a request containing an entitythe HTTP request then looks similar to a response (header fields and entity) the receiving process (the Web server) accepts the POST body Entities can use any format (it is specified in a header field)just like e-mails, entities can have multiple parts the parts are separated using the standard MIME mechanism POST Form Processing POST
is used if the form specifies itit can (but should not) be used for non-file forms it should be used for file upload forms (otherwise, only the name is uploaded) File upload forms must specify the appropriate encodingapplication/x-www-form-urlencoded
is the default (values in the entity)multipart/form-data
is required for file upload (multipart form data) The server side must be prepared to receive POST
requestsit must parse the entity rather than the URI's query string form values can then be extracted from the entity some environments (e.g., PHP) allow to handle GET
/POST
transparently Structuring Forms HTML forms are very loosely structuredform somewhere representing the containerinside the form a random collection of HTML and form inputs Visually, the structure often is (and should be) easy to seefor non-visual access, more structure must be provided accessibility has become a major issue on the Web Accessibility has many different facetsvoice browsers must be able to read aloud Web forms gateways should be able to intelligently re-structure Web forms Labels Label and form control are not connected by HTML <tr><td>Text:</td><td><input type="text" name="text"/></td></tr>
<tr><td>Password:</td><td><input type="password" name="password"/></td></tr> The label element allows to make this connectionit connects a form control with the describing label this association is now accessible to clients for processing <tr>
<td><label for="textctrl">Text:</label></td>
<td><input type="text" name="text" id="textctrl"/></td>
</tr>
<tr>
<td><label for="pwdctrl">Password:</label></td>
<td><input type="password" name="password" id="pwdctrl"/></td>
</tr> Fieldsets Complex forms may need structuringgroups of controls for subsets of the collected data this structure should be represented in markup <fieldset><legend>Billing</legend>billing form controls …</fieldset>
<fieldset><legend>Shipping</legend> shipping form controls … </fieldset> Excellent example for HTML's markup design philosophyif a client does not support fieldsets, the elements are ignored the title of the fieldset will be displayed, because it is text
Billing billing form controls …
Shipping shipping form controls …
Tabbing in Forms Tabbing is a very convenient way of navigating a formafter completing one field, users should be taken to the next the order should be defined by the form creator, not by accident the tabindex attribute defines the tabbing orderit contains a number which is interpreted relative to other numbers all form controls may carry a tabindex attribute tabindex 1-9:
1
7
3
6
8
2
1
4
5
9
Disabled and Readonly Controls In complex scenarios, certain controls may be disabled or readonlybased on a workflow, some controls may not apply in all cases for the sake of a consistent view, they should still be included in the interface Disabled controls are not usedthey cannot be tabbed to and never receive focus their value is not included in the form's submission data Readonly controls cannot be changedthey can be tabbed to and may receive focus their value may not be changed (important for radio buttons and checkboxes!) their value is included in the form's submission data Important: Never trust the Browser! Disabled and Readonly Controls Display XHTML and HTML Why XHTML? XHTML is XML and can be used with XML tools HTML is not XML and cannot be used with XML tools HTML very often is not validatedtag soup that is hard to work with (text-based processing)much harder to process as a machine-readable resource Browsers behave differently for HTML and XHTMLbackwards compatibility mode often is triggered by HTML XHTML often triggers (more) standards-compliant behavior HTML → XHTML Element and attributes names are lowercase End tags are required Empty elements should use null end-tag syntax (with a space) Attribute values must always be quoted <P>Paragraphs often are not closed properly.
<IMG WIDTH=200 HEIGHT=300 SRC="test.png">
<P>And images by definition are always empty elements. <p>Paragraphs often are not closed properly.</p>
<img width="200" height="300" src="test.png" />
<p>And images by definition are always empty elements.</p> HTML Matters HTML is not just getting text displayed Good HTML allows better browsing First represent as much as possible in HTML Then add what is missing as CSS and/or microformats Graceful degradation is important