Robust Web Data Extraction with XML Path Expressions

Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure” XML documents. In this approach, HTML documents are first normalized into XHMTL and then mapped to the desired XML application format by using XML path expressions and regular expressions. In this paper we describe a methodology for creating XML path (XPath) expressions that are capable of extracting data from virtually any HTML page, while placing an emphasis on the persistent integrity of these expressions. This robustness is critical given the vulnerability of extraction technologies to the continually changing content, structure, and formatting of pages on the Web. We define categories of extraction rules in terms of their dependence on content, structural, or formatting features, and provide practical tips on how to create dependable data extraction patterns for the Web.

[1]  Alvaro E. Monge Matching Algorithms within a Duplicate Detection System , 2000, IEEE Data Engineering Bulletin.

[2]  David Konopnicki,et al.  W3QS: A Query System for the World-Wide Web , 1995, VLDB.

[3]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[4]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[5]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[6]  Anand Rajaraman,et al.  Virtual database technology , 1997, SGMD.

[7]  Charles Axel Allen,et al.  WIDL, Application Integration with XML , 1997, World Wide Web journal.

[8]  Laks V. S. Lakshmanan,et al.  A declarative language for querying and restructuring the Web , 1996, Proceedings RIDE '96. Sixth International Workshop on Research Issues in Data Engineering.

[9]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2001, WWW '01.

[10]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[11]  Craig A. Knoblock,et al.  Automatic Data Extraction from Lists and Tables in Web Sources , 2001 .

[12]  Alberto O. Mendelzon,et al.  Querying the World Wide Web , 1997, International Journal on Digital Libraries.

[13]  J. Hendler Gleaning the Web , 1999 .

[14]  Michael Höding,et al.  Adapter Generation for Extracting and Querying Data from Web , 1999, WebDB.

[15]  Jussi Myllymaki,et al.  Informia: a mediator for integrated access to heterogeneous information sources , 1998, CIKM '98.

[16]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[17]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[18]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[19]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[20]  Jeffrey D. Ullman,et al.  Querying websites using compact skeletons , 2001, PODS '01.