Design of an RSS Crawler with Adaptive Revisit Manager [ Bum-Suk Lee, Jin Woo Im, Byung-Yeon Hwang, Du Zhang ]

Abstract:

RSS (Rich Site Summary, or Really Simple Syndication) is widely used for notifying readers of updated information on blogs and feeding news to readers quickly. RSS is very simple, and so is mostly used as a web service. However there is no satisfactory search engine which works for RSS. The reason is that RSS is continuously modified, and the structure of general search engines is ineffective to collect information from RSS sources. In this paper, we discuss a web crawling algorithm, and propose a structure for an RSS crawler which is geared toward collecting and updating RSS in the Web 2.0 environment. The proposed method (1) uses visited domain name history to predict the location of the RSS of a new seed URL, and (2) updates RSS information adaptively, based on some update-checking heuristics. These approaches can serve as cornerstones for an efficient and effective RSS search engine.