Web wrapper induction: a brief survey

Nowadays several companies use the information available on the Web for a number of purposes. However, since most of this information is only available as HTML documents, several techniques that allow information from the Web to be automatically extracted have recently been defined. In this paper we review the main techniques and tools for extracting information available on the Web, devising a taxonomy of existing systems. In particular we emphasize the advantages and drawbacks of the techniques analyzed from a user point of view.

[1]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[2]  Hector Garcia-Molina,et al.  Semistructured Data: The Tsimmis Experience , 1997, ADBIS.

[3]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[4]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[5]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[6]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[7]  Alberto O. Mendelzon,et al.  WebOQL: restructuring documents, databases and Webs , 1998, Proceedings 14th International Conference on Data Engineering.

[8]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[9]  Arnaud Sahuguet,et al.  Building intelligent Web applications using lightweight wrappers , 2001, Data Knowl. Eng..

[10]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[11]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[12]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[13]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[14]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[15]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..