A flexible learning system for wrapping tables and lists in HTML documents

A program that makes an existing website look like a database is called a wrapper. Wrapper learning is the problem of learning website wrappers from examples. We present a wrapper-learning system called WL2 that can exploit several different representations of a document. Examples of such different representations include DOM-level and token-level representations, as well as two-dimensional geometric views of the rendered page (for tabular data) and representations of the visual appearance of text asm it will be rendered. Additionally, the learning system is modular, and can be easily adapted to new domains and tasks. The learning system described is part of an "industrial-strength" wrapper management system that is in active use at WhizBang Labs. Controlled experiments show that the learner has broader coverage and a faster learning rate than earlier wrapper-learning systems.

[1]  Maurice Bruynooghe,et al.  Declarative bias for specific-to-general ILP systems , 1995, Machine Learning.

[2]  Boris Chidlovskii Wrapper generation by -reversible grammar induction , 2000 .

[3]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Based Algorithms: Results on a Calendar Scheduling Domain , 1995, ICML.

[4]  William W. Cohen A structured wrapper induction system for extracting information from semi-structured documents , 2001, IJCAI 2001.

[5]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[6]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[8]  Raymond J. Mooney,et al.  Inducing Deterministic Prolog Parsers from Treebanks: A Machine Learning Approach , 1994, AAAI.

[9]  Chun-Nan Hsu,et al.  Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules , 1998 .

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[12]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[13]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[14]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[15]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[16]  William W. Cohen Grammatically Biased Learning: Learning Logic Programs Using an Explicit Antecedent Description Language , 1994, Artif. Intell..

[17]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[18]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[19]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[20]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[21]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[22]  J. Davenport Editor , 1960 .

[23]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[24]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[25]  De Raedt,et al.  Advances in Inductive Logic Programming , 1996 .

[26]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Algorithms: Results on a Calendar Scheduling Domain , 2004, Machine Learning.

[27]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[28]  William W. Cohen,et al.  Learning Page-Independent Heuristics for Extracting Data from Web Pages , 1999, Comput. Networks.

[29]  Dayne Freitag,et al.  Multistrategy Learning for Information Extraction , 1998, ICML.

[30]  Dave Raggett Clean Up Your Web Pages with HTML TIDY , 1999 .

[31]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.