Effective Web Data Extraction with Standard XML Technologies

Jussi Myllymaki

Jussi Myllymaki, Effective Web Data Extraction with Standard XML Technologies, Computer Networks, 39(5):635-644, August 2002.

We describe an Extensible Markup Language (XML)-based methodology for Web data extraction that extends beyond simple "screen scraping". An ideal data extraction process can digest target Web databases that are visible only as Hypertext Markup Language (HTML) pages, and create a local replica of those databases as a result. What is needed is more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process must deal with such obstacles as session identifiers, HTML forms, client-side JavaScript, incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires solid data validation and error recovery to handle data extraction failures. Our ANDES software framework helps solve these problems and provides a platform for building a production-quality Web data extraction process. Key aspects of ANDES are that it uses XML technologies for data extraction, including Extensible HTML and Extensible Stylesheet Language Transformations, and provides access to the "deep Web".


Bibliography Navigation: Reference List; Author Index; Title Index; Keyword Index

Generated by sharef2html on 2011-04-15, 02:00:41.