Hypertext Transfer Protocol (HTTP)
|
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
|
Abstract
The Hypertext Transfer Protocol (HTTP) is the protocol which transports Web contents. It is based on the Internet's reliable transport protocol, the Transmission Control Protocol (TCP). HTTP is more than a simple file retrieval protocol, it provides content negotiation and other features which turn the Web into a flexible and scalable information system. The other basic infrastructure building block of the Web is the Domain Name System (DNS), which provides a naming system for the number-oriented addressing scheme of the Internet protocols.
Network History
- First regarded as a convenient workaround for floppy disks
real computer scientists write compilers
- the value of computer networks depends on their size
- Early networking solutions were vendor-specific islands
- DECnet for Digital Equipment Corporation (DEC) customers
- XNS for Xerox customers
- SNA for IBM customers
- transmitting data between these networks was very cumbersome
- Bridging networks transparently became increasingly important
- more computers and networks increase the benefit of interconnections
- layering being used for internetworks, not only for networks
Networks vs. Internetworks
- Specific networks use specific abstractions
- how to address nodes
- how to address applications on these nodes
- how to transmit data to these applications
- Internetworks provide a network-independent abstraction
- nodes are addressed uniformly (IP addresses)
- applications are identified uniformly (ports)
- data transmission uses one set of protocols (TCP/UDP)
Internet vs. ISO/OSI
- Global network emerges by the end of the 80's
- some kind of internetworking protocols were required
- ARPANET had been running since the late 60's (1965: Berkeley-MIT)
- ISO/OSI (Open Systems Interconnection) was a new specification
- the idea was to build something new
- OSI was specified rather than developed and tested
- For some time, it was unclear what the
global internetwork
would be based on- Internet protocols were already established and running
- OSI promised a fresh start with
bigger is better
protocols
Internet
- Very early start and a lot of experience
- pragmatic and evolutionary approach
if it's not broken, don't fix it
- Standardization by independent technical experts
- avoids the
designed by committee
effect of consortiums - conservative and concentrating on stability
- implementations are required to prove technical feasibility
- simplicity whenever possible
ISO/OSI (Open Systems Interconnection)
- Designed and specified in the 80's
- a
start-from-scratch
approach to an internetworking protocol suite - many research contributions, many research publications
- great for teaching principles of communications systems layers
- Problems with complexity and interoperability
- some OSI protocols were never fully implemented
- because of many optional features, interoperability was a problem
- Some of OSI's ideas had to be retrofitted to the Internet
- session management (Layer 5) now is HTTP cookies
- data representation (Layer 6) now is XML
OSI vs. Internet
Internet Protocols
IP Features
- End-to-end data transfer (IP addresses)
- Hiding lower-level heterogeneity
- Connection-less (each packet routed individually)
- Unreliable (packets may be lost or duplicated)
IP Addresses
- IP identifies by an IP address
- IP addresses are globally unique (and can be geocoded)
- IP uses 4 bytes for addresses (e.g.,
128.32.226.29)- maximum number of addresses: 232 = 4 billion
- IPv6 extends the address format to 16 bytes (2128 addresses)
- IP addresses are well-organized
- important for routing (i.e., sending packets to the target host)
- not ideally suited for mobile or ad-hoc networks
IP Address Classes
TCP Features
- Flow-controlled (avoiding congestion)
- Reliable (no data lost or duplicated)
- Connection-oriented
- Application addressing
Reliable Connections
- IP may drop or duplicate packets
- TCP adds serial numbers in data packets
- if problems are detected, TCP recovers automatically
- TCP avoids network congestion and system overload
- slow start avoid flooding receivers with data they cannot process
- fast retransmit for avoiding timeouts when losing data
- a sliding window for controlling the amount of outstanding packets
TCP Window
Naming vs. Addressing
- IP addresses depend on network topology and organization
- reorganizing a network may change all IP addresses
- identifying important hosts should not be address-based
- Names are supposed to be more stable than addresses
- a name is an abstract identification of something
- names can be used to obtain more information
- Network services should use names instead of addresses
- before using the service, a mapping has to be performed
- the Domain Name System (DNS) is providing this service
DNS Properties
- DNS has a bootstrap problem
- DNS provides a service and should thus be identified by a name
- for resolving names into addresses, the DNS service is required
- DNS configuration is part of basic Internet configuration
- DHCP provides IP address, netmask, gateway, and DNS server address
- DNS names are hierarchically structured
ischool.berkeley.edu, edu is the Top-Level Domain (TLD)- TLDs are either generic (gTLD) or country code (ccTLD)
- subdomains are federated (e.g.,
edu, us, uk, tv)
Names Matter
- Names are not unique and namespaces are finite
- name disputes arise which were irrelevant before the Web
cybersquatting
as a popular way to make money
- Names can be worth a lot of money
business.com was sold for $7.5 million
- Name inflation can be used to generate money
aero, biz, coop, info, jobs, mobi, museum, name, pro, travel
- Names can have political relevance
- ccTLDs are assigned based on the UNO's idea of what a country is
- Names can have symbolic relevance
- Catalonia managed to get a domain of its own (
cat)
Domain Name Space
Using DNS
- DNS is used by virtually all Internet applications
- names are more stable than addresses
- E-mail has some dedicated features built into DNS
- special entries (
MX records) identify the e-mail server for a domain - fallback entries help dealing with failing e-mail servers
- most URIs are based on DNS names
http://ischool.berkeley.edu/ identified the access protocol and the host- the browser first performs a DNS lookup
- a TCP connection is then established to the address returned by the DNS
DNS Request Processing
User Datagram Protocol (UDP)
- Transport protocol (based on IP, just like TCP)
- very thin protocol, adds little features to IP
- provides application addressing
- UDP is unreliable and connection-less
- ideal for fast streaming media (delay is critical, lost packets are tolerable)
- acceptable for one-packet applications (lightweight and fast)
- not acceptable for reliable data transfer
The Web's Protocol
DNS & HTTP
The two basic protocols which every Web browser must implement are DNS access and HTTP. However, most operating systems provide an API for DNS access, so the browser can use this service locally and only has to implement HTTP. TCP (which is required as the foundation for HTTP) is usually provided by the operating system.
HTTP Messages
- HTTP needs a reliable connection
- the foundation for HTTP is the Transmission Control Protocol (TCP)
- DNS resolution yields an IP address
- open TCP connection to port 80 or port specified in URI (
http://pc-4528.ethz.ch:8080/)
- HTTP is a text-based protocol
- the connection is used to transmit text messages
- all HTTP messages are human-readable
- basic HTTP operations can be carried out by hand
start-line
message-header *
message-body ?
HTTP Header Fields
- Header fields contain information about the message
- general header:
Date as the message origination date - request header:
Accept-Language indicated language preferences - response header:
Server contains system information - entity header:
Content-Type specifies the media type of the entity
- HTTP defines a number of header fields
- unknown fields must be ignored (extensibility)
- unstandardized fields should use a
X-
prefix
- HTTP is about acting on these fields
- HTTP defines what HTTP implementations must or should do
HTTP Requests
- After opening a connection, the client sends a request
- the method indicates the action to be performed on the resource
- HTTP's most interesting methods are:
GET, HEAD, POST - other interesting methods are:
PUT, DELETE
- The URI identifies the resource to which the request should be applied
- absolute URIs are required when contacting Proxies
- absolute paths are required when contacting a server directly
- the URI may contain Query Information
- fragment identifiers are not sent (they are interpreted on the client side)
- The
Host header field must be included in every request
Method Request-URI HTTP/Major.Minor
[Header]*
[Entity]?
HTTP GET
- Retrieval action based on the URI
- maybe implemented by reading a file
- maybe implemented by processing a file (PHP)
- maybe implemented by invoking a process
- Semantics may change based on header fields
If-*: only reply with the entity if necessaryRange: only reply with the requested part of the entity
- Cacheability depends on header fields of the response
GET / HTTP/1.1
Host: ischool.berkeley.edu
HTTP Responses
- The server's response to interpreting a request
- the status code is given numerically and as text
2** for variations of ok
3** for redirections4** are different client-side problems (404: not found)5** are different server-side problems
- Header fields specify additional information
- information about the server
- information about the entity (media type, encoding, language)
HTTP/Major.Minor Status-Code Text
[Header]*
[Entity]?
HTTP Performance
- HTTP/1.0 allowed one transaction per connection
- TCP connection setup and teardown are expensive
- TCP's slow start slows down the initial phase of data transfer
- typical Web pages use between 10-20 resources (HTML + images)
- typically, these resources are stored on the same server
- HTTP/1.1 introduces persistent connections
- the TCP connection stays open for some time (10sec is a popular choice)
- additional requests to the same server use the same TCP connection
- HTTP/1.1 introduces pipelined connections
- instead of waiting for a response, requests can be queued
- the server responds as fast as possible
- the order may not be changed (there is no sequence number)
HTTP Connection Handling
What is Content Negotiation?
- Negotiation between two HTTP peers
- resources may be available in different representations
- possible dimensions are language, graphics format, character encoding, ...
- using one URI, it should be possible to get the
best
resource
- Negotiation requires knowledge about the resource user
- languages depend on humans reading pages
- graphics formats depend on the browser's functionality
- Content negotiation is a form of a Web-based service
- client request a URI and have some constraints
- using these constraints, the best representation should be served
- ideally, content negotiation should not be too expensive
Three Different Variants
- Server-Side Content Negotiation
- the server has a set of representations and information from the request
- the server returns the
best
representation based on the request
- Client-Side Content Negotiation
- the server responds with a list of different representations
- the client (browser or user) makes a choice and sends a second request
- Transparent Content Negotiation
- Caches act as in client-side negotiation and thus know the available representations
- Clients contacting the cache can be served by the cache as in server-side negotiation
Server-Side Content Negotiation
- Clients usually tell something about themselves
Accept, Accept-Charset, Accept-Encoding, Accept-Language- the server also knows their IP address
- the server may also use additional information (Cookies)
- The server needs to find the
best representation
- most easily by matching the request with available representations
- could also be implemented more dynamically by generating new representations
HTTP and Security
- HTTP sends clear-text messages
- listening to HTTP traffic is trivial
- information transferred via simple HTTP is public
- Making HTTP requires additional mechanisms
- S-HTTP was an attempt to define a secure version of HTTP
- HTTPS uses a secure communication layer underneath HTTP
- Encryption is done by a layer on top of TCP
- Secure Sockets Layer (SSL) is the protocol layer invented by Netscape
- Transport Layer Security (TLS) is the standardized Internet version
HTTP and SSL
Proxies
- HTTP often is end-to-end
- there is a direct connection between my browser and the server
- HTTP allows using proxies, which are HTTP intermediaries
- Proxies are used for security reasons
- a proxy is an important part of a firewall
- it hides the user's identity by acting on behalf of the user
- proxies are ideally suited for logging and filtering
- Proxies are used for performance reasons
- requests and responses can be cached, speeding up responses significantly
- caching depends on the ability to know when the cache is outdated
- HTTP enables proxies to validate their cached copies
Browsers & Proxies
A proxy is configured in the browser (manually or automatically), so that the browser sends all requests to the proxy instead of the target Web server. The proxy then forwards the request. Proxies can be chained, so that the requests and responses travel through a number of HTTP systems.
Firewalls
- Firewalls are used to protect computers
- protecting users from worms and viruses
- protecting servers from intrusion attacks
- firewalls analyze and block traffic based on complex rules
- A reverse proxy can be part of a firewall concept
- it is configured and maintained by the service provider
- it is a single access point through which HTTP traffic goes
- it is good because it bundles access control to servers behind it
- it is bad because it is a single point of failure
Web Server Service
- HTTP is much more than file transfer
- it is a protocol for the concept of resource manipulation
- it is a distinct step away from the API approach to building distributed systems
- HTTP servers can be configured to deliver good or bad service
- this is a question of how well they are configured on the HTTP level
- it is also a question of how good the Web design is
- both issues together are required to set up a good Web server
- Assignment 1 is an exercise in providing a good service
- very simple configuration of Apache
- this already is
cutting edge
! most servers are not properly configured...