Web Search

Web Architecture and Information Management [./]
Spring 2010 — INFO 190-02 (CCN 42509)

Erik Wilde and Ryan Shaw, UC Berkeley School of Information

Creative Commons License [http://creativecommons.org/licenses/by/3.0/]

This work is licensed under a CC
Attribution 3.0 Unported License

Contents Erik Wilde and Ryan Shaw: Web Search


Erik Wilde and Ryan Shaw: Web Search

(2) Abstract

In his early vision of the Web, Tim Berners-Lee expected that most people would discover information by following hyperlinks, rather than by using keyword searches. Thus there is no search functionality built into the Web. Web search engines came later and had a profound effect on how we use and experience the Web. Now it is hard to imagine using the Web without search, a fact that has both technological and political implications.

The Technology of Web Search

Outline (The Technology of Web Search)

  1. The Technology of Web Search [15]
    1. Crawling [4]
    2. Indexing [4]
    3. Ranking [6]
  2. The Politics of Web Search [7]

(4) The Technology of Web Search

The specific details of how web search engines work are closely guarded trade secrets. However, the basic ingredients of web search are shared by all search engines. These are:

  1. Crawling, or finding resources
  2. Indexing, or analyzing representations
  3. Ranking, or ordering results for specific queries


Outline (Crawling)

  1. The Technology of Web Search [15]
    1. Crawling [4]
    2. Indexing [4]
    3. Ranking [6]
  2. The Politics of Web Search [7]

(6) Web Crawlers

  • A web crawler or web spider is a program that finds web resources by following links
  • Algorithm for a simple web crawler:
    1. Pick a URI to start with
    2. HTTP GET (download) that URI
    3. (Assuming it is HTML) extract URIs from all the links
    4. For each URI, go to step #2
  • In actuality, web crawling gets much more complicated than this
  • Designers of web crawlers have 2 goals:
    • Freshness: finding new content as soon as possible
    • Comprehensiveness: finding as much (good) content as possible

(7) Web Crawling Visualization

http://drunkmenworkhere.org/bintree/yahoo_small.png [http://drunkmenworkhere.org/219.php?a=yahoo2hour]

via Drunk Men Work Here [http://drunkmenworkhere.org/]

(8) robots.txt

  • Sometimes website operators do not want certain pages do be crawled
  • robots.txt is the answer to this problem
  • robots.txt is a text file placed at the root of a given domain
  • The file lists URI paths that crawlers should not visit
  • Compliance is voluntary: there is nothing to stop crawlers from visiting disallowed paths
    • However website operators may attempt to block crawlers that do not respect robots.txt

(9) Site Maps

  • Site maps are the opposite of robots.txt: they tell crawlers what is important
  • Website operators used to fill out forms to inform search engines about their sites
  • Site maps ease the process of informing crawlers about a site
    • Just submit the URI of your sitemap, or
    • refer to it from robots.txt [robots.txt (1)]
  • Site maps can specify for specific URIs:
    • the time they were last updated
    • how often they are expected to change
    • their importance relative to other URIs on the site
  • Supported by all major search engines
    • But listing a URI in a site map does not guarantee that it will be crawled


Outline (Indexing)

  1. The Technology of Web Search [15]
    1. Crawling [4]
    2. Indexing [4]
    3. Ranking [6]
  2. The Politics of Web Search [7]

(11) Indexing

  • Indexing is the process by which a search engine analyzes a web page and stores it for retrieval
  • Fast, effective indexing is critical to the quality of a search engine
  • Different kinds of content must be indexed in addition to HTML pages
    • PDFs, Excel, Powerpoint, Word
  • For HTML pages, Indexing algorithms must try determine which text is meaningful and which is not
  • This means separating content from navigation

(14) Indexing Multimedia

  • Indexing the content of audio, video and images is still a research problem
  • Most techniques are not fast or reliable enough for web search
  • Web search engines still rely on indexing the text around media objects (e.g. captions, comments)
  • Average pixel color can be used to detect pornography
  • Audio fingerprinting can be used to ID copyrighted content
  • Copies or reuses of images can be identified, e.g this famous kiss [http://www.tineye.com/search/9ab0f71ef8c36c1b2e55d3b33c3241213e65a082?page=24]
  • Faces can be detected, e.g. boa [http://images.google.com/images?q=boa] vs. BoA [http://images.google.com/images?q=boa&imgtype=face]
    • Face recognition getting better too
  • Some ability to distinguish photos [http://images.google.com/images?q=boa&imgtype=photo] from clip art [http://images.google.com/images?q=boa&imgtype=clipart] from line drawings [http://images.google.com/images?q=boa&imgtype=lineart]


(16) Ranking Results

  • Ranking is the "secret sauce" of any search engine
  • Good ranking is important
    • the vast majority of users do not look beyond the first page of results
  • Ranking can only be as good as indexing allows
    • you can't consider X as a factor if you haven't indexed it
  • Ranking takes into consideration not only what has been indexed (the content), but also the searcher and other context (time, location)

(17) Link Analysis

  • Google's major breakthrough with web search was PageRank, an algorithm for link analysis
  • The basic idea:
    • Links are indications of importance or popularity
    • The more links to a page, the more important/popular it is
    • Links from important/popular pages should count more
    • Links from pages on related topics should count more
  • An old idea, applied to citation graphs before the web
  • All the major search engines use some form of link analysis now

(19) nofollow

  • Problem: PageRank can be manipulated
  • Spammers attempt to place links to their sites all over the web
  • Blog comment threads are particularly attractive to spammers
  • Solution: indicate to web crawlers [Web Crawlers (1)] that certain links may not be trustworthy
  • A rel="nofollow" attribute on an anchor (link) tag does this
  • Web crawlers will not follow these links, and link analysis algorithms will not take them into consideration when ranking
  • Most blogging software now automatically adds the nofollow attribute to links posted in comment threads

(20) Other Ranking Factors

  • Location of query term
    • in the URI
    • in the page title
    • in the anchor text of links pointing to the site
    • in the body text
  • age of site (older is usually better)
  • site's link structure (hence site maps [Site Maps (1)]
  • Many more: Google claims to consider more than 200 factors when ranking

(21) Web Spam

  • Search is not just about putting relevant web sites at the top of the results
  • It's also about making sure bad sites don't show up in results
  • This requires major effort at every step:
    • Avoiding crawling such sites (hence nofollow [nofollow (1)])
    • Indexing properly, e.g. ignoring hidden text or false descriptions
    • Ranking properly, e.g. identifying and ignoring sites trying to game the system
  • This may be the hardest part of operating a search engine

The Politics of Web Search

Outline (The Politics of Web Search)

  1. The Technology of Web Search [15]
    1. Crawling [4]
    2. Indexing [4]
    3. Ranking [6]
  2. The Politics of Web Search [7]

(23) Search Engines as Gatekeepers

(24) Ranking Intervention

(25) Search Engines as TV Camera Crews

Search engines are like a TV camera crew let loose in the middle of a crowd of rowdy fans after a game. Seeing the camera, everyone acts boorishly and jostles to get in front. The act of observing something changes it.

Lee Gomes, "Our Columnist Creates Web 'Original Content' But Is in for a Surprise" [http://online.wsj.com/public/article/SB114116587424585798.html], Wall Street Journal, 2006

(26) The Rich Get Richer?

(27) Homogenization & Hegemony

(28) Personalized Search

(29) Search and Policy

2010-03-29 Web Architecture and Information Management [./]
Spring 2010 — INFO 190-02 (CCN 42509)