Crawling the Web [ Gautam Pant, Padmini Srinivasan, Filippo Menczer ]

Citation

Gautam Pant, Padmini Srinivasan, Filippo Menczer, Crawling the Web, pp. 153-178, Mark Levene, Alexandra Poulovassilis (Ed.), Web Dynamics: Adapting to Change in Content, Size, Topology and Use, Springer-Verlag, Berlin, Germany, November 2004, 978-3-540-40676-1.

Descriptions

Abstract:

The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. While some systems rely on crawlers that exhaustively crawl the Web, others incorporate "focus" within their crawlers to harvest application or topic specific collections. We discuss the basic issues related with developing a crawling infrastructure. This is followed by a review of several topical crawling algorithms, and evaluation metrics that may be used to judge their performance. While many innovative applications of Web crawling are still being invented, we take a brief look at some developed in the past.

Resources

Google:	Search for `["Crawling the Web" Gautam Pant Padmini Srinivasan Filippo Menczer]`
URI:	`http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf`