Generalized Stochastic Tree Automata for Multi-relational Data Mining

This paper addresses the problem of learning a statistical distribution of data in a relational database. Data we want to focus on are represented with trees which are a quite natural way to represent structured information. These trees are used afterwards to infer a stochastic tree automaton, using a well-known grammatical inference algorithm. We propose two extensions of this algorithm: use of sorts and generalization of the infered automaton according to a local criterion. We show on some experiments that our approach scales with large databases and both improves the predictive power of the learned model and the convergence of the learning algorithm.

[1]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[2]  Jorge Calera-Rubio,et al.  Computing the Relative Entropy Between Regular Tree Languages , 1998, Inf. Process. Lett..

[3]  A. N. V. Rao,et al.  Approximating grammar probabilities: solution of a conjecture , 1986, JACM.

[4]  Naoki Abe,et al.  Predicting Protein Secondary Structure Using Stochastic Tree Grammars , 1997, Machine Learning.

[5]  Jorge Calera-Rubio,et al.  Stochastic Inference of Regular Tree Languages , 2004, Machine Learning.

[6]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[7]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[8]  J. Oncina Inference of recognizable tree sets , 2003 .

[9]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[10]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[11]  Amaury Habrard,et al.  Multi-relational Data Mining in Medical Databases , 2003, AIME.

[12]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[13]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[14]  Nandit Soparkar,et al.  Frequent Itemset Counting Across Multiple Tables , 2000, PAKDD.

[15]  M. Bernard,et al.  Apprentissage de programmes logiques par inférence grammaticale , 2000 .

[16]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[17]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[18]  Hannu Toivonen,et al.  Discovery of frequent DATALOG patterns , 1999, Data Mining and Knowledge Discovery.

[19]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[20]  Juan Ramón Rico-Juan,et al.  Probabilistic k-Testable Tree Languages , 2000, ICGI.

[21]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[22]  Timo Knuutila,et al.  The Inference of Tree Languages from Finite Samples: An Algebraic Approach , 1994, Theor. Comput. Sci..