Systematic Approach for Optimizing Complex Mining Tasks on Multiple Databases

Many real world applications involve not just a single dataset, but a view of multiple datasets. These datasets may be collected from different sources and/or at different time instances. In such scenarios, comparing patterns or features from different datasets and understanding their relationships can be an extremely important part of the KDD process. This paper considers the problem of optimizing a mining task over multiple datasets, when it has been expressed using a highlevel interface. Specifically, we make the following contributions: 1) We present an SQL-based mechanism for querying frequent patterns across multiple datasets, and establish an algebra for these queries. 2) We develop a systematic method for enumerating query plans and present several algorithms for finding optimized query plan which reduce execution costs. 3) We evaluate our algorithms on real and synthetic datasets, and show up to an order of magnitude performance improvement

[1]  Michèle Sebag,et al.  Scalability and efficiency in multi-relational data mining , 2003, SKDD.

[2]  Franco Turini,et al.  Experiences with a Logic-based knowledge discovery Support Environment , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[3]  Surajit Chaudhuri,et al.  Scalable classification over SQL databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Laks V. S. Lakshmanan,et al.  Constraint-Based Multidimensional Data Mining , 1999, Computer.

[5]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[6]  Alfred V. Aho,et al.  The theory of joins in relational data bases , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[7]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[8]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[9]  S. Ruggles Integrated Public Use Microdata Series , 2021, Encyclopedia of Gerontology and Population Aging.

[10]  Srinivasan Parthasarathy,et al.  Towards NIC-based intrusion detection , 2003, KDD '03.

[11]  Chris Clifton,et al.  Query flocks: a generalization of association-rule mining , 1998, SIGMOD '98.

[12]  Laks V. S. Lakshmanan,et al.  The 3W Model and Algebra for Unified Data Mining , 2000, VLDB.

[13]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[14]  Rosa Meo,et al.  Query Rewriting in Itemset Mining , 2004, FQAS.

[15]  Luc De Raedt,et al.  An algebra for inductive query evaluation , 2003, Third IEEE International Conference on Data Mining.

[16]  Salvatore Orlando,et al.  Statistical properties of transactional databases , 2004, SAC '04.

[17]  Christian Borgelt,et al.  Mining Fragments with Fuzzy Chains in Molecular Databases , 2004 .

[18]  Luc De Raedt,et al.  A perspective on inductive databases , 2002, SKDD.

[19]  Luc De Raedt,et al.  Molecular feature mining in HIV data , 2001, KDD '01.

[20]  Jennifer Widom,et al.  A First Course in Database Systems , 1997 .

[21]  Laks V. S. Lakshmanan,et al.  Mining frequent itemsets with convertible constraints , 2001, Proceedings 17th International Conference on Data Engineering.

[22]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[23]  Giuseppe Psaila,et al.  A New SQL-like Operator for Mining Association Rules , 1996, VLDB.

[24]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[25]  Srinivasan Parthasarathy,et al.  Mining frequent itemsets in distributed and dynamic databases , 2003, Third IEEE International Conference on Data Mining.

[26]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[27]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[28]  Joseph L. Hellerstein,et al.  Discovery in multi-attribute data with user-defined constraints , 2002, SKDD.

[29]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[30]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[31]  Wei Wang,et al.  Mining protein family specific residue packing patterns from protein structure graphs , 2004, RECOMB.

[32]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[33]  Wei Wang,et al.  DMQL: A Data Mining Query Language for Relational Databases , 2007 .

[34]  Daniel Kifer,et al.  How to quickly find a witness , 2003, PODS '03.

[35]  Carlo Zaniolo,et al.  ATLaS: a Turing-Complete Extension of SQL for Data Mining Applications and Streams , 2002 .

[36]  Toon Calders,et al.  On Monotone Data Mining Languages , 2001, DBPL.

[37]  Daniel Kifer,et al.  DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints , 2002, Data Mining and Knowledge Discovery.

[38]  Roger Barga,et al.  Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006, Atlanta, GA, USA , 2006, ICDE Workshops.

[39]  Surajit Chaudhuri,et al.  An overview of query optimization in relational systems , 1998, PODS.

[40]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[41]  Laks V. S. Lakshmanan,et al.  Optimization of constrained frequent set queries with 2-variable constraints , 1999, SIGMOD '99.