Crawling the Web

Gautam Pant, Padmini Srinivasan, Filippo Menczer


The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate "focus" within their crawlers to harvest application or topic specific collections. We discuss the basic issues related with developing a crawling infrastructure. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. While many innovative applications of Web crawling are still being invented, we take a brief look at some developed in the past.


Bibliography Navigation: Reference List; Author Index; Title Index; Keyword Index

Generated by sharef2html on 2011-04-15, 02:00:41.