Efficient Techniques for Effective Wrapper Induction

Several studies have recently concentrated on the generation of wrappers for extracting data from Web data sources. The ROADRUNNER system aims at automating the tedious and expensive process of writing wrappers in an unsupervised, domain-independent, and scalable manner. The system is based on a grammar inference algorithm, called MATCH, which has been designed in a sound theoretical framework. However, in its original definition MATCH lacks in expressivity; that is, in many cases when MATCH runs over real-life Web pages, it is not able to produce a solution. In this paper we address the challenging issue of developing techniques that allow us to build upon MATCH an effective and efficient system, without renouncing to the original formal background. First, we analyze the main limitations of MATCH; then we illustrate the techniques we have developed to overcome such limitations. Finally we report on the results of some experiments, that show the efficacy of the introduced techniques and demonstrate the improvements of the overall system.

[1]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[2]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[3]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[4]  E. Balas,et al.  Set Partitioning: A survey , 1976 .

[5]  Valter Crescenzi,et al.  Handling irregularities in ROADRUNNER , 2004, AAAI 2004.

[6]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[7]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[8]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[9]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[10]  Roger Barga,et al.  Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006, Atlanta, GA, USA , 2006, ICDE Workshops.

[11]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[12]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[13]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[14]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[15]  S da SilvaAltigran,et al.  A brief survey of web data extraction tools , 2002 .

[16]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..