Effective Web Data Extraction with Standard XML Technologies [ Jussi Myllymaki ]

Abstract:

We describe an Extensible Markup Language (XML)-based methodology for Web data extraction that extends beyond simple "screen scraping". An ideal data extraction process can digest target Web databases that are visible only as Hypertext Markup Language (HTML) pages, and create a local replica of those databases as a result. What is needed is more than a Web crawler and set of Web site wrappers. A comprehensive data extraction process must deal with such obstacles as session identifiers, HTML forms, client-side JavaScript, incompatible datasets and vocabularies, and missing and conflicting data. Proper data extraction also requires solid data validation and error recovery to handle data extraction failures. Our ANDES software framework helps solve these problems and provides a platform for building a production-quality Web data extraction process. Key aspects of ANDES are that it uses XML technologies for data extraction, including Extensible HTML and Extensible Stylesheet Language Transformations, and provides access to the "deep Web".