Interactive Learning of Node Selecting Tree Transducers ⋆

We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply them to visually interactive Web information extraction. We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (Nstts), a particular class of tree au-tomata that we introduce. We prove that deterministic Nstts capture the class of queries definable in monadic second order logic (Mso) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction, and prove that monadic queries defined by Nstts can be answered efficiently. We present a new polynomial time algorithm in Rpni-style that learns monadic queries defined by deterministic Nstts from completely annotated examples, where all selected nodes are distinguished. In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree pruning heuristics. We introduce pruning Nstts-a formalism that shares many advantages of Nstts. This leads us to an interactive learning algorithm for monadic queries defined by pruning Nstts, which satisfies a new formal active learning model in the style of Angluin (1987). We have implemented our interactive learning algorithm and integrated it into a visually interactive Web information extraction system – called Squirrel– by plugging it into the Mozilla Web browser. Experiments on ⋆ A previous version of this article was published in Machine Learning 66,1 (2007) 33–67. 2 Julien Carme et al. realistic Web documents confirm excellent quality with very few user interactions during wrapper induction.

[1]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[2]  James W. Thatcher,et al.  Characterizing Derivation Trees of Context-Free Grammars through a Generalization of Finite Automata Theory , 1967, J. Comput. Syst. Sci..

[3]  E. Mark Gold,et al.  Complexity of Automaton Identification from Given Data , 1978, Inf. Control..

[4]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[5]  Yasubumi Sakakibara,et al.  Learning context-free grammars from structural data in polynomial time , 1988, COLT '88.

[6]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[7]  Kevin J. Lang Random DFA's can be approximately learned from sparse uniform examples , 1992, COLT '92.

[8]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  Helmut Seidl,et al.  Locating Matches of Tree Patterns in Forests , 1998, FSTTCS.

[11]  Frank Neven,et al.  Expressiveness of structured document query languages based on attribute grammars , 1998, JACM.

[12]  Barak A. Pearlmutter,et al.  Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm , 1998, ICGI.

[13]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[14]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[15]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[16]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[17]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[18]  Derick Wood,et al.  Regular tree and regular hedge languages over unranked alphabets , 2001 .

[19]  Boris Chidlovskii,et al.  Wrapping Web Information Providers by Transducer Induction , 2001, ECML.

[20]  Nicholas Kushmerick,et al.  Finite-State Approaches to Web Information Extraction , 2002, SCIE.

[21]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[22]  Georg Gottlob,et al.  Monadic queries over tree-structured data , 2002, Proceedings 17th Annual IEEE Symposium on Logic in Computer Science.

[23]  Thomas Schwentick,et al.  Query automata over finite trees , 2002, Theor. Comput. Sci..

[24]  Frank Drewes,et al.  Learning a Regular Tree Language from a Teacher , 2003, Developments in Language Theory.

[25]  J. Oncina Inference of recognizable tree sets , 2003 .

[26]  Christoph Koch,et al.  Query evaluation on compressed trees , 2003, 18th Annual IEEE Symposium of Logic in Computer Science, 2003. Proceedings..

[27]  Maurice Bruynooghe,et al.  Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference , 2003, IJCAI.

[28]  Joachim Niehren,et al.  Learning Node Selecting Tree Transducer from Completely Annotated Examples , 2004, ICGI.

[29]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[30]  Joachim Niehren,et al.  Querying Unranked Trees with Stepwise Tree Automata , 2004, RTA.

[31]  Colin de la Higuera,et al.  Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[32]  Helmut Seidl On the finite degree of ambiguity of finite tree automata , 2004, Acta Informatica.

[33]  James W. Thatcher,et al.  Generalized finite automata theory with an application to a decision problem of second-order logic , 1968, Mathematical systems theory.

[34]  Maurice Bruynooghe,et al.  Learning (k, l)-Contextual Tree Languages for Information Extraction , 2005, ECML.

[35]  Christof Löding,et al.  Deterministic Automata on Unranked Trees , 2005, FCT.

[36]  Joachim Niehren,et al.  Tree Automata , 2005 .

[37]  Joachim Niehren,et al.  Interactive learning of node selecting tree transducer , 2006, Machine Learning.

[38]  Leonid Libkin,et al.  Logics for Unranked Trees: An Overview , 2005, Log. Methods Comput. Sci..

[39]  Joachim Niehren,et al.  On the minimization of XML Schemas and tree automata for unranked trees , 2007, J. Comput. Syst. Sci..