Learning twig and path queries

We investigate the problem of learning XML queries, path queries and twig queries, from examples given by the user. A learning algorithm takes on the input a set of XML documents with nodes annotated by the user and returns a query that selects the nodes in a manner consistent with the annotation. We study two learning settings that differ with the types of annotations. In the first setting the user may only indicate required nodes that the query must select (i.e., positive examples). In the second, more general, setting, the user may also indicate forbidden nodes that the query must not select (i.e., negative examples). The query may or may not select any node with no annotation. We formalize what it means for a class of queries to be learnable. One requirement is the existence of a learning algorithm that is sound i.e., always returning a query consistent with the examples given by the user. Furthermore, the learning algorithm should be complete i.e., able to produce every query with sufficiently rich examples. Other requirements involve tractability of the learning algorithm and its robustness to nonessential examples. We identify practical classes of Boolean and unary, path and twig queries that are learnable from positive examples. We also show that adding negative examples to the picture renders learning unfeasible.

[1]  Frank Neven,et al.  Learning deterministic regular expressions for the inference of schemas from XML data , 2008, WWW.

[2]  J. Oncina,et al.  INFERRING REGULAR LANGUAGES IN POLYNOMIAL UPDATED TIME , 1992 .

[3]  Takeshi Shinohara,et al.  Polynomial Time Inference of Extended Regular Pattern Languages , 1983, RIMS Symposium on Software Science and Engineering.

[4]  Laks V. S. Lakshmanan,et al.  Tree pattern query minimization , 2002, The VLDB Journal.

[5]  Hiroyuki Kitagawa,et al.  A machine learning approach to rapid development of XML mapping queries , 2004, Proceedings. 20th International Conference on Data Engineering.

[6]  Michael Benedikt,et al.  XPath satisfiability in the presence of DTDs , 2008, JACM.

[7]  Dana Angluin,et al.  Finding patterns common to a set of strings (Extended Abstract) , 1979, STOC.

[8]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[9]  Dana Angluin,et al.  Inductive Inference of Formal Languages from Positive Data , 1980, Inf. Control..

[10]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[11]  Alin Deutsch,et al.  Containment and Integrity Constraints for XPath , 2001, KRDB.

[12]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[13]  Henning Fernau,et al.  Extracting Minimum Length Document Type Definitions Is NP-Hard , 2004, ICGI.

[14]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[15]  James W. Thatcher,et al.  Generalized finite automata theory with an application to a decision problem of second-order logic , 1968, Mathematical systems theory.

[16]  Colin de la Higuera,et al.  A bibliographical study of grammatical inference , 2005, Pattern Recognit..

[17]  Setsuo Arikawa,et al.  Pattern Inference , 1995, GOSLER Final Report.

[18]  Joachim Niehren,et al.  Learning n-Ary Node Selecting Tree Transducers from Completely Annotated Examples , 2006, ICGI.

[19]  Dan Suciu,et al.  Containment and equivalence for a fragment of XPath , 2004, JACM.

[20]  Joachim Niehren,et al.  Machine Learning manuscript No. (will be inserted by the editor) Interactive Learning of Node Selecting Tree , 2008 .

[21]  Aurélien Lemay,et al.  Interactive Learning of Node Selecting Tree Transducers ⋆ , 2010 .

[22]  Maurice Bruynooghe,et al.  Learning (k,l)-contextual tree languages for information extraction from web pages , 2008, Machine Learning.

[23]  Dana Angluin,et al.  Inference of Reversible Languages , 1982, JACM.

[24]  Jan Van den Bussche,et al.  Induction of Relational Algebra Expressions , 2009, ILP.

[25]  Srinivasan Parthasarathy,et al.  Query by output , 2009, SIGMOD Conference.

[26]  Pedro García,et al.  IDENTIFYING REGULAR LANGUAGES IN POLYNOMIAL TIME , 1993 .

[27]  Ayumi Shinohara,et al.  Polynomial-time learning of elementary formal systems , 2000, New Generation Computing.

[28]  Andrzej Ehrenfeucht,et al.  Complexity measures for regular expressions , 1974, STOC '74.

[29]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[30]  Joachim Niehren,et al.  A learning algorithm for top-down XML transformations , 2010, PODS.

[31]  Joachim Niehren,et al.  Schema-Guided Induction of Monadic Queries , 2008, ICGI.

[32]  Thomas Schwentick,et al.  XPath query containment , 2004, SGMD.

[33]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[34]  Colin de la Higuera,et al.  Characteristic Sets for Polynomial Grammatical Inference , 1997, Machine Learning.

[35]  Enrique Vidal,et al.  Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Max C. Göbel,et al.  Query-Based Learning of XPath Expressions , 2006, ICGI.

[37]  Jennifer Widom,et al.  Synthesizing view definitions from data , 2010, ICDT '10.

[38]  Max C. Göbel,et al.  Wrapper Induction , 2009, Encyclopedia of Database Systems.

[39]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[40]  Marcelo Arenas,et al.  XML data exchange: consistency and query answering , 2005, PODS '05.

[41]  Dana Angluin,et al.  Finding Patterns Common to a Set of Strings , 1980, J. Comput. Syst. Sci..

[42]  Dana Angluin,et al.  Learning Regular Sets from Queries and Counterexamples , 1987, Inf. Comput..

[43]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[44]  Leonid Libkin Logics for Unranked Trees: An Overview , 2005, ICALP.

[45]  Thomas Schwentick,et al.  XPath Containment in the Presence of Disjunction, DTDs, and Variables , 2003, ICDT.