Automatic information extraction from large websites

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.

[1]  V. Radhakrishnan,et al.  Inference of regular grammars via skeletons , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Leonard Pitt,et al.  Inductive Inference, DFAs, and Computational Complexity , 1989, AII.

[3]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[4]  Keith L. Clark,et al.  Using Grammatical Inference to Automate Information Extraction from the Web , 2001, PKDD.

[5]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[6]  Berthier A. Ribeiro-Neto,et al.  Extracting semi-structured data through examples , 1999, CIKM '99.

[7]  Valter Crescenzi,et al.  RoadRunner: automatic data extraction from data-intensive web sites , 2002, SIGMOD '02.

[8]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[9]  Craig A. Knoblock,et al.  Wrapper Maintenance: A Machine Learning Approach , 2011, J. Artif. Intell. Res..

[10]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[11]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[12]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[13]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[14]  Henning Fernau,et al.  Identification of function distinguishable languages , 2000, Theor. Comput. Sci..

[15]  Craig A. Knoblock,et al.  Wrapper generation for semi-structured Internet sources , 1997, SGMD.

[16]  Nicholas Kushmerick,et al.  Wrapper verification , 2000, World Wide Web.

[17]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[18]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[19]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[20]  Derick Wood,et al.  One-Unambiguous Regular Languages , 1998, Inf. Comput..

[21]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[22]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[23]  Paolo Atzeni,et al.  Cut and Paste , 1999, J. Comput. Syst. Sci..

[24]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[25]  Valter Crescenzi,et al.  Automatic annotation of data extracted from large Web sites , 2003, WebDB.

[26]  Valter Crescenzi Roma , Italy On Automatic Information Extraction from Large Web Sites , 2003 .

[27]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[28]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[29]  Valter Crescenzi,et al.  Grammars Have Exceptions , 1998, Inf. Syst..

[30]  Arnaud Sahuguet,et al.  Web Ecology: Recycling HTML Pages as XML Documents Using W4F , 1999, WebDB.

[31]  Henning Fernau,et al.  Learning XML Grammars , 2001, MLDM.

[32]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[33]  Anand Rajaraman,et al.  Virtual database technology , 1997, SGMD.

[34]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[35]  Kristina Lerman,et al.  Learning the Common Structure of Data , 2000, AAAI/IAAI.

[36]  Stéphane Grumbach,et al.  In Search of the Lost Schema , 1999, ICDT.

[37]  A. Vansant Cut and paste. , 2002, Pediatric physical therapy : the official publication of the Section on Pediatrics of the American Physical Therapy Association.

[38]  Richard Hull,et al.  A Survey of Theoretical Research on Typed Complex Database Objects , 1988, XP7.52 Workshop on Database Theory.

[39]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[40]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[41]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[42]  Boris Chidlovskii Wrapper generation by -reversible grammar induction , 2000 .

[43]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[44]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..