Roma , Italy On Automatic Information Extraction from Large Web Sites

Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from Web sites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised – i.e., fully automatic – wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the paper stand in the definition of a class of regular languages, called the prefix mark-up languages, that nicely abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The paper shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the paper has been implemented in a working prototype. Experiments on known Web sites further demonstrate the practical applicability of the proposed approach.

[1]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[2]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[3]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[4]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[5]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[6]  Arnaud Sahuguet,et al.  Web Ecology: Recycling HTML Pages as XML Documents Using W4F , 1999, WebDB.

[7]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[8]  Anand Rajaraman,et al.  Virtual database technology , 1997, Proceedings 14th International Conference on Data Engineering.

[9]  Henning Fernau Identification of Function Distinguishable Languages , 2000, ALT.

[10]  Stéphane Grumbach,et al.  In Search of the Lost Schema , 1999, ICDT.

[11]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[12]  Richard Hull,et al.  A Survey of Theoretical Research on Typed Complex Database Objects , 1988, XP7.52 Workshop on Database Theory.

[13]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[14]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[15]  Kristina Lerman,et al.  Learning the Common Structure of Data , 2000, AAAI/IAAI.

[16]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[17]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[18]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[19]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[20]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[21]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[22]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[23]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[24]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[25]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[26]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[27]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[28]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[29]  V. Radhakrishnan,et al.  Inference of regular grammars via skeletons , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[30]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[31]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[32]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..