SQL based frequent pattern mining

Data mining on large relational databases has gained popularity and its significance is well recognized. However, the performance of SQL based data mining is known to fall behind specialized implementation since the prohibitive nature of the cost associated with extracting knowledge, as well as the lack of suitable declarative query language support. Frequent pattern mining is a foundation of several essential data mining tasks. These facts motivated us to develop original SQL-based approaches for mining frequent patterns. In this work, we investigate approaches based on SQL for the problem of finding frequent patterns from a transaction table. Most of them adopt Apriori-like approaches. However those methods may suffer from the inferior performance since the costly candidate-generation-and-test operation especially when mining datasets with prolific patterns and/or long patterns. We develop a class of efficient SQL based pattern growth methods for mining frequent patterns. The commonality of these approaches is that they use a divide and conquer method to decompose mining tasks and then use a pattern growth method to avoid the combinatory problem inherent to candidate-generation-and-test approach. Apriori algorithms with the help of SQL either require several scans over the data or require many and complex joins between the input tables. While our SQL-based algorithms avoid making multiple passes over the large original input table and complex joins between the tables. A comprehensive performance study evaluates on DBMS (IBM DB2 UDB EEE V8) and compares the performance results between SQL based frequent pattern mining approaches based on Apriori and the approaches in this thesis. The empirical results show that our algorithms can get efficient performance. Moreover, recently

[1]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[2]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[3]  Robert L. Grossman,et al.  Data mining tasks and methods: parallel methods for scaling data mining algorithms to large data sets , 2002 .

[4]  Kai-Uwe Sattler,et al.  SQL based frequent pattern mining without candidate generation , 2004, SAC '04.

[5]  Wei Wang,et al.  DMQL: A Data Mining Query Language for Relational Databases , 2007 .

[6]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[7]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[8]  Tomasz Imielinski,et al.  Second-generation data mining: concepts and implementation , 1998 .

[9]  Fabrizio Silvestri,et al.  Adaptive and resource-aware mining of frequent sets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[11]  Nandit Soparkar,et al.  Data organization and access for efficient data mining , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  David Wai-Lok Cheung,et al.  Asynchronous parallel algorithm for mining association rules on a shared-memory multi-processors , 1998, SPAA '98.

[13]  Bart Goethals,et al.  Efficient frequent pattern mining , 2002 .

[14]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[15]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[16]  Heikki Mannila,et al.  A database perspective on knowledge discovery , 1996, CACM.

[17]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[18]  Kai-Uwe Sattler,et al.  SQL Based Frequent Pattern Mining with FP-Growth , 2004, INAP/WLP.

[19]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[20]  Jiawei Han,et al.  DBMiner: A System for Mining Knowledge in Large Relational Databases , 1996, KDD.

[21]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[22]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[23]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[24]  Masaru Kitsuregawa,et al.  Parallel FP-Growth on PC Cluster , 2003, PAKDD.

[25]  Ganesh Ramesh,et al.  Indexing and Data Access Methods for Database Mining , 2002, DMKD.

[26]  Kai-Uwe Sattler,et al.  Efficient Frequent Pattern Mining in Relational Databases , 2004, LWA.

[27]  Osmar R. Zaïane,et al.  Fast parallel association rule mining without candidacy generation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[28]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[29]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[30]  Jiawei Han,et al.  Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes , 1997, KDD.

[31]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[32]  Masaru Kitsuregawa,et al.  Hash based parallel algorithms for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[33]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-Memory Multi-Processors , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[34]  J. C. Bioch,et al.  Mining Frequent Intemsets in Memory-Resident Databases , 2000 .

[35]  Michael G. Norman,et al.  Much ado about shared-nothing , 1996, SGMD.

[36]  Kyuseok Shim,et al.  Developing Tightly-Coupled Data Mining Applications on a Relational Database System , 1996, KDD.

[37]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[38]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[39]  Kyuseok Shim,et al.  PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning , 1998, Data Mining and Knowledge Discovery.

[40]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[41]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[42]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[43]  Arun N. Swami,et al.  Set-Oriented Data Mining in relational Databases , 1995, Data Knowl. Eng..

[44]  Ramesh C Agarwal,et al.  Depth first generation of long patterns , 2000, KDD '00.

[45]  Ke Wang,et al.  Mining frequent item sets by opportunistic projection , 2002, KDD.

[46]  Wen-Yang Lin,et al.  Mining Generalized Association Rules with Multiple Minimum Supports , 2001, DaWaK.

[47]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[48]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[49]  Jiawei Han,et al.  Efficient mining of partial periodic patterns in time series database , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[50]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[51]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[52]  Jean-François Boulicaut,et al.  Query Languages Supporting Descriptive Rule Mining: A Comparative Study , 2004, Database Support for Data Mining Applications.

[53]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[54]  Patrick Valduriez,et al.  Overview of Parallel Architectures for Databases , 1993, Comput. J..

[55]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[56]  Sharma Chakravarthy,et al.  Performance Evaluation and Optimization of Join Queries for Association Rule Mining , 1999, DaWaK.

[57]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[58]  Ulrich Güntzer,et al.  Algorithms for association rule mining — a general survey and comparison , 2000, SKDD.

[59]  Jian Pei,et al.  Can we push more constraints into frequent pattern mining? , 2000, KDD '00.

[60]  Charu C. Aggarwal,et al.  A Tree Projection Algorithm for Generation of Frequent Item Sets , 2001, J. Parallel Distributed Comput..

[61]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[62]  Jennifer Widom,et al.  Clustering association rules , 1997, Proceedings 13th International Conference on Data Engineering.

[63]  Hongjun Lu,et al.  H-mine: hyper-structure mining of frequent patterns in large databases , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[64]  Sunita Sarawagi,et al.  Integrating Mining with Relational Database Systems: Alternatives and Implications. , 1998, SIGMOD 1998.

[65]  A. Akhmetova Discovery of Frequent Episodes in Event Sequences , 2006 .

[66]  Tomasz Imielinski,et al.  MSQL: A Query Language for Database Mining , 1999, Data Mining and Knowledge Discovery.

[67]  Giuseppe Psaila,et al.  A New SQL-like Operator for Mining Association Rules , 1996, VLDB.

[68]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[69]  Hendrik Blockeel,et al.  Multi-Relational Data Mining , 2005, Frontiers in Artificial Intelligence and Applications.

[70]  Alan L. Cox,et al.  Efficient mining for association rules with relational database systems , 1999, Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265).

[71]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[72]  Dan A. Simovici,et al.  Mining rules in single-table and multiple-table databases , 2002 .

[73]  Walter A. Kosters,et al.  Apriori, A Depth First Implementation , 2003, FIMI.

[74]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[75]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[76]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[77]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[78]  Masaru Kitsuregawa,et al.  Parallel SQL Based Association Rule Mining on Large Scale PC Cluster: Performance Comparison with Directly Coded C Implementation , 1999, PAKDD.

[79]  Masaru Kitsuregawa,et al.  SQL Based Association Rule Mining Using Commercial RDBMS (IBM DB2 UBD EEE) , 2000, DaWaK.

[80]  Ulrich Güntzer,et al.  Mining Association Rules: Deriving a Superior Algorithm by Analyzing Today's Approaches , 2000, PKDD.

[81]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[82]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[83]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.

[84]  Umeshwar Dayal,et al.  Multi-dimensional sequential pattern mining , 2001, CIKM '01.

[85]  Hongjun Lu,et al.  Ascending frequency ordered prefix-tree: efficient mining of frequent patterns , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[86]  Kai-Uwe Sattler,et al.  Depth-first frequent itemset mining in relational databases , 2005, SAC '05.

[87]  Jian Pei,et al.  Constrained frequent pattern mining: a pattern-growth view , 2002, SKDD.

[88]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[89]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[90]  Salvatore Orlando,et al.  Enhancing the Apriori Algorithm for Frequent Set Counting , 2001, DaWaK.

[91]  R. Agrawal,et al.  Research Report Mining Sequential Patterns: Generalizations and Performance Improvements Limited Distribution Notice Mining Sequential Patterns: Generalizations and Performance Improvements , 1996 .

[92]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[93]  Gomer Thomas,et al.  Practitioner problems in need of database research , 1991, SGMD.

[94]  Ralf Rantzau,et al.  Processing frequent itemset discovery queries by division and set containment join operators , 2003, DMKD '03.

[95]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[96]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[97]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.