Constrained mining of patterns in large databases

A theoretical framework is introduced to model data mining problems as the answering of queries in inductive databases. Inductive queries are requests to find out patterns in a database satisfying certain user-specified constraints. Through the analysis of the answer sets to inductive queries composed from anti-monotonic and monotonic basic predicates using Boolean operators, interesting properties, such as “dimension”, are found, which are useful for query optimization. The concept of version spaces has been extended to “generalized version spaces” to encapsulate such answer sets. Generalized version spaces are closed under the usual set operations, thus providing the closure property akin to relation algebra. This generic theoretical framework has been applied to various application domains and various algorithms and optimization techniques have been devised to make use of the theoretical results to efficiently answer queries to inductive databases. Experiments show that these techniques are applicable.

[1]  Baptiste Jeudy,et al.  Using Constraints During Set Mining: Should We Prune or not? , 2000 .

[2]  Stephen Dunn Smiles , 1932 .

[3]  Wei Wang,et al.  DMQL: A Data Mining Query Language for Relational Databases , 2007 .

[4]  Pedro M. Domingos,et al.  Relational Markov models and their application to adaptive web navigation , 2002, KDD.

[5]  David Wai-Lok Cheung,et al.  Effect of Data Skewness and Workload Balance in Parallel Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Bart Goethals,et al.  On Supporting Interactive Association Rule Mining , 2000, DaWaK.

[7]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[8]  Amanda Clare,et al.  Machine learning of functional class from phenotype data , 2002, Bioinform..

[9]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[10]  Haym Hirsh,et al.  Generalizing Version Spaces , 1994, Machine Learning.

[11]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[12]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[13]  François Jacquenet,et al.  Mining Frequent Logical Sequences with SPIRIT-LoG , 2002, ILP.

[14]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[15]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[16]  Witold Abramowicz,et al.  Knowledge Discovery for Business Information Systems , 2001 .

[17]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[18]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[19]  Sau-dan. Lee,et al.  Maintenance of association rules in large databases , 1997 .

[20]  S. Vera,et al.  Induction of Concepts in the Predicate Calculus , 1975, IJCAI.

[21]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[22]  Michael Sipser,et al.  Introduction to the Theory of Computation , 1996, SIGA.

[23]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[24]  M. Kirsten,et al.  Distance based approaches to relational learning and clustering , 2001 .

[25]  Elena Baralis,et al.  Incremental Refinement of Mining Queries , 1999, DaWaK.

[26]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[27]  Mike Paterson,et al.  Linear Unification , 1978, J. Comput. Syst. Sci..

[28]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[29]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[30]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[31]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[32]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[33]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[34]  David Wai-Lok Cheung,et al.  Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules , 1998, Data Mining and Knowledge Discovery.

[35]  David Wai-Lok Cheung,et al.  A General Incremental Technique for Maintaining Discovered Association Rules , 1997, DASFAA.

[36]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[37]  Laks V. S. Lakshmanan,et al.  Constraint-Based Multidimensional Data Mining , 1999, Computer.

[38]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[39]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[40]  Luc De Raedt,et al.  An algebra for inductive query evaluation , 2003, Third IEEE International Conference on Data Mining.

[41]  Luc De Raedt,et al.  Towards Optimizing Conjunctive Inductive Queries , 2004, KDID.

[42]  Luc De Raedt,et al.  Constraint Based Mining of First Order Sequences in SeqLog , 2004, Database Support for Data Mining Applications.

[43]  Luc De Raedt,et al.  A Theory of Clausal Discovery , 1993, IJCAI.

[44]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[45]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[46]  N. Jacobs Relational Sequence Learning and User Modelling , 2004 .

[47]  Laks V. S. Lakshmanan,et al.  Mining frequent itemsets with convertible constraints , 2001, Proceedings 17th International Conference on Data Engineering.

[48]  Hannu Toivonen,et al.  Discovery of frequent DATALOG patterns , 1999, Data Mining and Knowledge Discovery.

[49]  Heikki Mannila,et al.  On an algorithm for finding all interesting sentences , 1996 .

[50]  Nicola Fanizzi,et al.  Ideal Theory Refinement under Object Identity , 2000, ICML.

[51]  Luc De Raedt,et al.  Theta-Subsumption for Structural Matching , 1997, ECML.

[52]  Luc De Raedt,et al.  A theory of inductive query answering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[53]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[54]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[55]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[56]  Saul Greenberg,et al.  USING UNIX: COLLECTED TRACES OF 168 USERS , 1988 .

[57]  Stephen Muggleton,et al.  Machine Invention of First Order Predicates by Inverting Resolution , 1988, ML.

[58]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[59]  Jan Ramon,et al.  Clustering and instance based learning in first order logic , 2002, AI Communications.

[60]  Shan-Hwei Nienhuys-Cheng,et al.  Foundations of Inductive Logic Programming , 1997, Lecture Notes in Computer Science.

[61]  D. Cheung,et al.  Maintenance of Discovered Association Rules: When to update? , 1997, DMKD.

[62]  H. Hirsh Theoretical Underpinnings of Version Spaces , 1991, IJCAI.

[63]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[64]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[65]  Hendrik Blockeel,et al.  From Shell Logs to Shell Scripts , 2001, ILP.

[66]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[67]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[68]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[69]  Luc De Raedt,et al.  An Efficient Algorithm for Mining String Databases Under Constraints , 2004, KDID.