Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference

Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on 10 from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.

[1]  Helena Ahonen,et al.  Generating grammars for structured documents using grammatical inference methods , 1994 .

[2]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[3]  Siemion Fajtlowicz,et al.  On conjectures of Graffiti , 1988, Discret. Math..

[4]  Maurice Bruynooghe,et al.  Information Extraction in Structured Documents Using Tree Automata Induction , 2002, PKDD.

[5]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[6]  J. van Leeuwen,et al.  Grammatical Inference: Algorithms and Applications , 2000, Lecture Notes in Computer Science.

[7]  Hiroshi Sakamoto,et al.  Knowledge Discovery from Semistructured Texts , 2002, Progress in Discovery Science.

[8]  Hector Garcia-Molina,et al.  Extracting Semistructured Information from the Web. , 1997 .

[9]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[10]  Stephen Muggleton,et al.  Inductive acquisition of expert knowledge , 1986 .

[11]  François Bry,et al.  Towards a Declarative Query and TransformationLanguage for XML and Semistructured Data:Simulation Unification , 2002 .

[12]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[13]  L. Shapley,et al.  Potential Games , 1994 .

[14]  R. Aumann Almost Strictly Competitive Games , 1961 .

[15]  Murali Mani,et al.  Taxonomy of XML schema languages using formal language theory , 2005, TOIT.

[16]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[17]  Tuomas Sandholm,et al.  Methods for Boosting Revenue in Combinatorial Auctions , 2004, AAAI.

[18]  Jacques-François Thisse,et al.  Unilaterally competitive games , 1992 .

[19]  Maarten de Rijke,et al.  Wrapper Generation via Grammar Induction , 2000, ECML.

[20]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[23]  Hervé Moulin Cooperation in Mixed Equilibrium , 1976, Math. Oper. Res..

[24]  Fangzhen Lin Discovering State Invariants , 2004, KR.

[25]  K. Minton Extraction Patterns for Information Extraction Tasks : A Survey , 1999 .

[26]  François Bry,et al.  Towards a Declarative Query and Transformation Language for XML and Semistructured Data: Simulation Unification , 2002, ICLP.

[27]  Naoki Abe,et al.  Predicting Protein Secondary Structure Using Stochastic Tree Grammars , 1997, Machine Learning.

[28]  Maurice Bruynooghe,et al.  Information extraction by means of a generalized k-testable tree automata inference algorithm , 2002 .

[29]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[30]  Tuomas Sandholm,et al.  Approximating Revenue-Maximizing Combinatorial Auctions , 2005, AAAI.

[31]  Rajesh Parekh,et al.  Automata Induction, Grammar Inference, and Language Acquisition , 2000 .

[32]  Douglas B. Lenat,et al.  Automated Theory Formation in Mathematics , 1977, IJCAI.

[33]  Pat Langley,et al.  The Computer-Aided Discovery of Scientific Knowledge , 1998, Discovery Science.

[34]  Robert W. Rosenthal,et al.  Correlated equilibria in some classes of two-person games , 1974 .

[35]  Derick Wood,et al.  Regular tree and regular hedge languages over unranked alphabets , 2001 .

[36]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[37]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[38]  Dayne Freitag,et al.  Using grammatical inference to improve precision in information extraction , 1997, ICML 1997.

[39]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[40]  MiningChun-Nan Hsu Finite-state Transducers for Semi-structured Text Mining , 1999 .

[41]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[42]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[43]  Juan Ramón Rico-Juan,et al.  Probabilistic k-Testable Tree Languages , 2000, ICGI.