A sampling-based framework for parallel data mining

The goal of data mining algorithm is to discover useful information embedded in large databases. Frequent itemset mining and sequential pattern mining are two important data mining problems with broad applications. Perhaps the most efficient way to solve these problems sequentially is to apply a pattern-growth algorithm, which is a divide-and-conquer algorithm [9, 10]. In this paper, we present a framework for parallel mining frequent itemsets and sequential patterns based on the divide-and-conquer strategy of pattern growth. Then, we discuss the load balancing problem and introduce a sampling technique, called selective sampling, to address this problem. We implemented parallel versions of both frequent itemsets and sequential pattern mining algorithms following our framework. The experimental results show that our parallel algorithms usually achieve excellent speedups.

[1]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[2]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[3]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[4]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[5]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[6]  Ran Wolff,et al.  Communication-Efficient Distributed Mining of Association Rules , 2001, SIGMOD '01.

[7]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[8]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[9]  Masaru Kitsuregawa,et al.  Tree Structure Based Parallel Frequent Pattern Mining on PC Cluster , 2003, DEXA.

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[12]  Bart Goethals,et al.  Advances in frequent itemset mining implementations: report on FIMI'03 , 2004, SKDD.

[13]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2003, Third IEEE International Conference on Data Mining.

[14]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[15]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[16]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[17]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  Ming-Syan Chen,et al.  Mining Sequential Alarm Patterns in a Telecommunication Database , 2001, Databases in Telecommunications.

[20]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[21]  Osmar R. Zaïane,et al.  Fast parallel association rule mining without candidacy generation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[22]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[24]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[25]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[26]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[27]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[28]  Jian Pei,et al.  Mining frequent patterns by pattern-growth: methodology and implications , 2000, SKDD.

[29]  Valerie Guralnik,et al.  Parallel tree-projection-based sequence mining algorithms , 2004, Parallel Comput..

[30]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[31]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.