Content Syndication

Web Architecture [./]
Fall 2008 — INFO 290-03 (CCN 42584)

Erik Wilde, UC Berkeley School of Information
2008-10-30

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License
[http://creativecommons.org/licenses/by/3.0/]

Contents E. Wilde: Content Syndication

Contents

E. Wilde: Content Syndication

(2) Abstract

For many information sources on the Web, it is useful to have some standardized way of subscribing to information updates. Syndication formats such as RSS and Atom can be used by these information sources to publish a feed of updated information items. While RSS and Atom are read-only formats, the Atom Publishing Protocol (AtomPub) build on top of Atom and provides a protocol for submitting new items to feeds.



E. Wilde: Content Syndication

(3) Content Feeds



Syndication Formats

Outline (Syndication Formats)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]

RSS

RSS E. Wilde: Content Syndication

(6) RSS History

  • The Myth of RSS Compatibility [http://diveintomark.org/archives/2004/02/04/incompatible-rss] provides a good overview
  • RSS is a schoolbook example for why standards are a good thing
    • RSS 0.9 [RSS 0.9 (1)] was created for the My Netscape portal in March 1999
    • RSS 0.91 (a simplification) was introduced in July 1999 (as an interim solution)
    • the AOL/Netscape merger removed the format from the company's portal
    • RSS was without an owner, and different parties claimed/denied ownership
    • RSS 1.0 [RSS 1.0 (1)] was created by an informal developer group
    • RSS 0.92 (and 0.93 and 0.94) were published without acknowledging RSS 1.0
    • finally, RSS 2.0 [RSS 2.0 (1)] was released as a follow-up to the RSS 0.9x versions
  • Using RSS has become an exercise in managing a menagerie of versions


RSS E. Wilde: Content Syndication

(7) RSS 0.9

  • RSS means RDF Site Summary (or Rich Site Summary?)
    • based on an RDF draft and not compatible with the final RDF specification
    • RDF was considered too cumbersome and unstable
    • 0.90 (proto-RDF) was quickly replaced by the non-RDF 0.91 version
  • RSS 0.92+ versions were developed as unilateral specifications
    • starting with RSS 0.91, RSS means Rich Site Summary
    • it is no longer built on RDF, instead it simply uses XML
    • the 0.9x branch eventually was renamed to RSS 2.0 [RSS 2.0 (1)]


RSS E. Wilde: Content Syndication

(9) RSS 1.0

  • RSS means RDF Site Summary (this time for real)
    • based on the final RDF specification and thus incompatible with any RSS 0.9 [RSS 0.9 (1)]
    • developed when the Semantic Web [Semantic Web] and RDF [Semantic Web; Resource Description Framework (RDF) (1)] were first heavily marketed (1999 [http://dret.net/biblio/reference/lee99])
    • RDF was expected to become the format for metadata on the Web
  • RSS 1.0 makes heavy use of XML Namespaces
  • RSS 1.0 introduces features which were not present in 0.91
    • date information for published items (very relevant for news feeds)
    • individual authors for various items in a feed
  • RSS 1.0 is the latest version of RDF-based RSS
    • the Semantic Web [Semantic Web] wave is not over yet, but RDF [Semantic Web; Resource Description Framework (RDF) (1)] has lost its novelty appeal
    • for a more XML-oriented encoding, RSS 0.9 [RSS 0.9 (1)] provides a better foundation


RSS E. Wilde: Content Syndication

(11) RSS 2.0

  • RSS now means Really Simple Syndication
    • RSS 2.0 is the continuation of the 0.91 branch (which dropped RDF)
    • together with RSS 1.0 [RSS 1.0 (1)] it is the most popular version of RSS
    • migration from 0.91 to 2.0 is easily possible
  • RSS 2.0 tries to avoid the use of XML Namespaces
  • RSS 2.0 is increasingly used with extensions [http://rss-extensions.org/wiki/Main_Page] for vendor-specific information
    • the RSS core is minimal, so many applications need extensions
    • many extensions have overlapping functionality
    • most extensions have unclear semantics and unclear versioning policies


RSS E. Wilde: Content Syndication

(13) The Case for Content Management

  • RSS is very rarely produced by hand
    • by definition, RSS contains redundant information for a specific purpose
  • If a Content Management System (CMS) is used, RSS can be generated
    • basic metadata can be generated by the CMS (title, author, date)
    • better tagging of content results in better tagging of feeds
    • well-tagged feeds are better foundations for large-scale reuse of feed items
  • Blogging is simply a specialized case of a CMS
    • Web-based interface for controlling everything
    • strictly time-ordered sequenced of published items
    • navigation features primarily based on the time-specific facets of the blog (maybe tags)
    • all blogging tools include feed support


RSS E. Wilde: Content Syndication

(14) Consuming RSS

  • RSS feeds often have quality problems
    • surprisingly often feeds do not even deliver well-formed XML
    • the use of embedded markup in RSS is not well-defined
  • Writing an RSS reader from scratch is not a good idea
  • There are three major tasks which RSS readers must do
    1. accept non-XML RSS feeds and fix them to be XML
    2. look at the feed contents and bring them into a unified form
    3. produce a unified view of feeds regardless of the RSS version


RSS E. Wilde: Content Syndication

(15) RSS Technical Problems

  • What to put into an item's description
    • the fundamental question is whether a description is text or HTML
    • if there is no well-defined way, then interpretation is client-specific
      <description>This is a <em>very important</em> blog post …
      <description>This is a &lt;em>very important&lt;/em> blog post …
      <description>This is a blog post about <em> in RSS feeds …
      <description>This is a blog post about &lt;em> in RSS feeds …
      <description>This is a blog post about &amp;lt;em> in RSS feeds …
  • Underspecified and not very robust in various other areas
    • broken RSS is accepted by most readers (but fixing it can change the interpretation)
    • the interpretation of relative URIs is not mentioned in the specifications
    • some minimal semantics (classification) for items would be very useful


RSS E. Wilde: Content Syndication

(16) RSS Political Problems

  • Multiple and incompatible RSS History [RSS History (1)] are still in widespread use
    • RSS 1.0 [RSS 1.0 (1)] and RSS 2.0 [RSS 2.0 (1)] are incompatible by design (RDF vs. non-RDF)
    • none of the RSS versions is maintained by a universally accepted standards body
  • None of the specifications is being updated or fixed
    • some of the lessons learned by RSS deployment are not used in a new version
    • it is unlikely that a new version will be produced which merges the RSS landscape
  • Invent something new instead of trying to fix RSS
    • Atom [Atom (1)] started in 2003 (called Echo at first)
    • W3C or IETF would have been promising candidates for a new RSS
    • W3C is more formal, IETF is more developer-centered
    • IETF was chosen over W3C [http://www.bestkungfu.com/?p=492] because the of Atom community's preferences


Atom

Atom E. Wilde: Content Syndication

(18) Atom History

atom-logo.png
  • RSS's shortcomings were very apparent and could not be fixed
  • In mid-2003, discussions started about an improved format
  • It also became apparent that the format should have a protocol
  • Atom 0.3 was released in December 2003 but had no formal home
  • IETF was chosen as the new home with a working group in June 2004
  • RFC 4287 [http://dret.net/rfc-index/reference/RFC4287] was published in December 2005
  • AtomPub [Atom Publishing Protocol (1)] has been published as RFC 5032 [http://dret.net/rfc-index/reference/RFC5032] in October 2007


Atom E. Wilde: Content Syndication

(19) Atom vs. RSS

  • Standardized by the IETF (well-defined process)
  • Classification of entries (user-defined categories)
  • More XML-like markup design (more nesting)
  • Namespaces are used and supported as standard mechanism
  • Atom feeds must be well-formed XML (there even is a schema [http://atompub.org/2005/08/17/atom.rnc])
  • Interpretation of content is well-defined (various content types)
  • Support for xml:lang and xml:base


Atom E. Wilde: Content Syndication

(20) Atom Example

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">
 <title>ongoing</title>
 <id>http://www.tbray.org/ongoing/</id>
 <link rel='self' href="http://www.tbray.org/ongoing/ongoing.atom"/>
 <updated>2007-04-11T12:55:09-07:00</updated>
 <author>
  <name>Tim Bray</name>
 </author>
 <subtitle>ongoing fragmented essay by Tim Bray</subtitle>
 <entry xml:base="When/200x/2007/04/02/">
  <title>Atom Publishing Protocol Interop!</title>
  <id>http://www.tbray.org/ongoing/When/200x/2007/04/02/APP-Interop</id>
  <published>2007-04-02T13:00:00-07:00</published>
  <updated>2007-04-10T14:24:00-07:00</updated>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Technology/Atom"/>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Technology"/>
  <category scheme="http://www.tbray.org/ongoing/What/" term="Atom"/>
  <content type="xhtml">
   <div xmlns="http://www.w3.org/1999/xhtml">
    <p>Mark your calendar: <a href="http://www.intertwingly.net/wiki/pie/April2007Interop">April 16-17 at Google</a>. <em>Everybody</em> is invited, provided they bring along an APP implementation, client or server. This was just announced a couple of days ago, and as I write this there are already <s>six</s> twelve client and <s>seven</s> fourteen server implementations signed up to be there and try to <a href="http://www.intertwingly.net/wiki/pie/InteropGrid">fill in the grid</a>. Let’s drop some names, in alphabetical order: AOL, Flock, Google, IBM, Lotus, Microsoft, Oracle, O’Reilly, Six Apart, Sun, WordPress. Um, have I mentioned that the APP is going to be huge?</p>
   </div>
  </content>
 </entry>
</feed>


Atom E. Wilde: Content Syndication

(21) Atom Content

  • RSS had no safe way of finding out what an entry's content is
    • this led to different implementations using smart ways of what the RSS author really wanted
    • one of Atom's main goals was to improve this in a well-defined way
    • Atom allows escaped markup (the only way to include non-XML HTML in an XML format)
  • Each content element should have a type (the default is text)
  • Atom's content interpretation algorithm (use first applicable rule):
    1. if type is text, no child elements are allowed (plain text content)
    2. if type is html then RSS's method of escaped markup is used
    3. if type is xhtml then there must be an div containing XHTML markup
    4. if type is an XML media type [Media Types] then the content should be treated as this type
    5. if type starts with text/ then no child elements are allowed
    6. for all other values, the content must be an base64-encoded entity of the specified MIME type


Atom E. Wilde: Content Syndication

(23) Atom Categories

  • Atom allows to assign categories to entries
    • each category element must have a term attribute for the category
    • an optional scheme identifies the categorization scheme (ontology, taxonomy, …)
    • an optional label attribute provides a human-readable label for the category
  • AtomPub [Atom Publishing Protocol (1)] defines a document format for Category Documents [Category Documents (1)]
  • Three different cases of categorization can be distinguished
    1. use a well-known scheme (such as Dublin Core)
    2. use a private but well-designed scheme (which has a URI and can be reused reliably)
    3. use tags without schemes, which then are little more than content labels
  • Widely-known tags are not easy to handle [http://www.tbray.org/ongoing/When/200x/2007/02/01/Tag-Scheme]
    • they are more than just privately assigned tags
    • there is no formal scheme for them, just an emerging consensus


Atom E. Wilde: Content Syndication

(24) Switching from RSS to Atom

  • Generate both feeds but serve RSS with a HTTP redirect (301)
    • old subscribers with broken clients can still use the RSS feed
    • old subscribers with correct clients will use the Atom feed
  • Atom exposes more information than RSS (category for tags)
    • the mapping of publishing info to the feed has to be changed/extended
    • for standard metadata use Atom's built-in metadata elements
    • for application-specific metadata consider reusing an existing metadata schema
  • Atom can be used to publish snippets as well as full content
    • content allows any type of content to be used and may contain a complete entry
    • summary allows only text and should provide a condensed version of an entry
    • some Atom sources publish two feeds for summaries and content
  • Generate good Atom and downgrade it to RSS 1.0 & 2.0


Syndication Aggregation

Outline (Syndication Aggregation)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]
Syndication Aggregation E. Wilde: Content Syndication

(26) End-User Aggregation

feed-icon.png
<link rel="alternate" type="application/rdf+xml" title="…" href="…" />
<link rel="alternate" type="application/rss+xml" title="…" href="…" />
<link rel="alternate" type="application/atom+xml" title="…" href="…" />


Syndication Aggregation E. Wilde: Content Syndication

(27) Aggregation Intermediaries



FeedBurner

Outline (FeedBurner)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]
FeedBurner E. Wilde: Content Syndication

(29) Fixing Feeds

Cleaning Up Feeds

FeedBurner E. Wilde: Content Syndication

(30) Load Balancing

Providing Feed Load Balancing

FeedBurner E. Wilde: Content Syndication

(31) Statistics/Analytics

Providing Feed Statistics

FeedBurner E. Wilde: Content Syndication

(32) Query Capabilities

  • Feed technology is still evolving
  • Feeds are mostly viewed as ordered by time
    • allows optimization for accesses and caching
    • makes it hard to use feeds for non-timed information
  • Feeds could be ordered by any sort key
    • makes server-side feed processing much more expensive
    • enables customized feeds that are processed on the server-side


FeedBurner E. Wilde: Content Syndication

(33) Supporting Queryable Feeds

Supporting Queryable Feeds

Atom Publishing Protocol

Outline (Atom Publishing Protocol)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]
Atom Publishing Protocol E. Wilde: Content Syndication

(35) Syndication Format Protocols



Atom Publishing Protocol E. Wilde: Content Syndication

(36) RESTified Syndication



Atom Publishing Protocol E. Wilde: Content Syndication

(37) Collections, Members, Entries, Media



Atom Publishing Protocol E. Wilde: Content Syndication

(38) Protocol Summary

Resource HTTP Method Representation Description
Introspection GET Atom Service Document [Service Documents (1)] Enumerates a set of collections and lists their URIs and other information about the collections
Collection GET Atom Feed A list of member of the collection (this may be a subset of all entries in the collection)
Collection POST Atom Entry Create a new entry in the collection
Member GET Atom Entry Get the Atom Entry
Member PUT Atom Entry Update the Atom Entry
Member DELETE n/a Delete the Atom Entry from the collection


Atom Publishing Protocol E. Wilde: Content Syndication

(39) Service Documents

Service Documents represent server-defined groups of Collections, and are used to initialize the process of creating and editing resources.


Atom Publishing Protocol E. Wilde: Content Syndication

(40) Service Document Example

<service xmlns="http://purl.org/atom/app#" xmlns:atom="http://www.w3.org/2005/Atom">
 <workspace>
  <atom:title>Main Site</atom:title>
  <collection href="http://example.org/reilly/main">
   <atom:title>My Blog Entries</atom:title>
   <categories href="http://example.com/cats/forMain.cats"/>
  </collection>
  <collection href="http://example.org/reilly/pic">
   <atom:title>Pictures</atom:title>
   <accept>image/*</accept>
  </collection>
 </workspace>
 <workspace>
  <atom:title>Side Bar Blog</atom:title>
  <collection href="http://example.org/reilly/list">
   <atom:title>Remaindered Links</atom:title>
   <accept>entry</accept>
   <categories fixed="yes">
    <atom:category scheme="http://example.org/extra-cats/" term="joke"/>
    <atom:category scheme="http://example.org/extra-cats/" term="serious"/>
   </categories>
  </collection>
 </workspace>
</service>


Atom Publishing Protocol E. Wilde: Content Syndication

(41) Category Documents



Atom Publishing Protocol E. Wilde: Content Syndication

(42) Category Document Example

<app:categories xmlns:app="http://purl.org/atom/app#" xmlns="http://www.w3.org/2005/Atom" fixed="yes" scheme="http://example.com/cats/big3">
 <category term="animal"/>
 <category term="vegetable"/>
 <category term="mineral"/>
</app:categories>


Conclusions

Outline (Conclusions)

  1. Syndication Formats [18]
    1. RSS [11]
    2. Atom [7]
  2. Syndication Aggregation [7]
    1. FeedBurner [5]
  3. Atom Publishing Protocol [8]
  4. Conclusions [1]
Conclusions E. Wilde: Content Syndication

(44) Semantic Web Light



2008-10-30 Web Architecture [./]
Fall 2008 — INFO 290-03 (CCN 42584)