Information extraction from structured documents using k-testable tree automaton inference

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree automata, which are like finite state automata but parse trees instead of strings. In this work, we explore induction of k-testable ranked tree automata from a small set of annotated examples. We describe three variants which differ in the way they generalize the inferred automaton. Experimental results on a set of benchmark data sets show that our approach compares favorably to string-based approaches. However, the quality of the extraction is still suboptimal.

[1]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[2]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[3]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[4]  Rajesh Parekh,et al.  Automata Induction, Grammar Inference, and Language Acquisition , 2000 .

[5]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[6]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[7]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[8]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[9]  Alberto O. Mendelzon,et al.  Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 12-14, 1997, Tucson, Arizona, USA , 1997, PODS 1997.

[10]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[11]  Juan Ramón Rico-Juan,et al.  Probabilistic k-Testable Tree Languages , 2000, ICGI.

[12]  J. Ross Quinlan,et al.  Learning logical definitions from relations , 1990, Machine Learning.

[13]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[14]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[15]  Michael Schroeder,et al.  Intelligent Information Integration , 2005 .

[16]  Serge Abiteboul,et al.  Inferring structure in semistructured data , 1997, SGMD.

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Pedro García Learning k-Testable tree sets from positive data* , 2003 .

[19]  Dayne Freitag,et al.  Using grammatical inference to improve precision in information extraction , 1997, ICML 1997.

[20]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[21]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[22]  MiningChun-Nan Hsu Finite-state Transducers for Semi-structured Text Mining , 1999 .

[23]  Alain Quéré,et al.  Définition et Etude des Bilangages Réguliers , 1968, Inf. Control..

[24]  S. Boag,et al.  XQuery 1.0 : An XML query language, W3C Working Draft 12 November 2003 , 2003 .

[25]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[26]  Erich J. Neuhold,et al.  Jedi: extracting and synthesizing information from the Web , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[27]  K. Minton Extraction Patterns for Information Extraction Tasks : A Survey , 1999 .

[28]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[29]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[30]  Dayne Freitag,et al.  Information Extraction from HTML: Application of a General Machine Learning Approach , 1998, AAAI/IAAI.

[31]  Arnaud Sahuguet,et al.  Looking at the Web through XML glasses , 1999, Proceedings Fourth IFCIS International Conference on Cooperative Information Systems. CoopIS 99 (Cat. No.PR00384).

[32]  Hiroshi Sakamoto,et al.  Knowledge Discovery from Semistructured Texts , 2002, Progress in Discovery Science.

[33]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[34]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[35]  Stéphane Bressan,et al.  Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents , 2002, DEXA.

[36]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[37]  Maurice Bruynooghe,et al.  Information extraction by means of a generalized k-testable tree automata inference algorithm , 2002 .

[38]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[39]  Jorge Calera-Rubio,et al.  Stochastic Inference of Regular Tree Languages , 2004, Machine Learning.

[40]  William W. Cohen WHIRL: A word-based information representation language , 2000, Artif. Intell..

[41]  François Bry,et al.  Towards a Declarative Query and Transformation Language for XML and Semistructured Data: Simulation Unification , 2002, ICLP.

[42]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[43]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[44]  Masako Takahashi,et al.  Generalizations of Regular Sets and Their Applicatin to a Study of Context-Free Languages , 1975, Inf. Control..

[45]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[46]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[47]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[48]  Yasubumi Sakakibara,et al.  Efficient Learning of Context-Free Grammars from Positive Structural Examples , 1992, Inf. Comput..

[49]  Paolo Atzeni,et al.  Cut and paste , 1997, PODS '97.

[50]  Yasubumi Sakakibara,et al.  Recent Advances of Grammatical Inference , 1997, Theor. Comput. Sci..

[51]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[52]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[53]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[54]  François Bry,et al.  Towards a Declarative Query and TransformationLanguage for XML and Semistructured Data:Simulation Unification , 2002 .

[55]  Georg Gottlob,et al.  Monadic datalog and the expressive power of languages for web information extraction , 2002, JACM.

[56]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[57]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[58]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[59]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..