Parallel mining of maximal sequential patterns using multiple samples

In this paper, we propose a new parallel algorithm, named PMSPX, which mines maximal frequent sequences by using multiple samples to exclude infrequent candidates effectively. A frequent sequence is maximal if none of its supersequences is frequent. Unlike the traditional single-sample methods developed for mining frequent itemsets, PMSPX uses multiple samples. Thus, it can avoid or alleviate some problems inherent in the single-sample methods. We theoretically analyzed how to increase the minimum support level to prevent misestimating infrequent candidates as frequent in the mining of samples. PMSPX is a parallel version of our sequential MSPX algorithm, and it is developed on a cluster of workstations. In PMSPX, each processing node uses MSPX to find a candidate set of local maximal frequent sequences first, independently from other processing nodes. Then, a top-down search is performed, starting with all the candidates, in a synchronous manner to identify real maximal frequent sequences. This asynchronous local mining followed by synchronous global mining approach minimizes the synchronization and communication among the processing nodes. Three database partitioning methods are proposed to distribute the database across the processing nodes, so that their workloads are balanced and the data skewness of the whole database is preserved in the data partition of each node. A comprehensive analysis was performed on PMSPX and existing parallel sequence mining algorithms, and extensive experiments were conducted on PMSPX. PMSPX demonstrates very good speedup and scaleup properties. It also requires less communication and synchronization than other parallel algorithms.

[1]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[2]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[3]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Discovery Science.

[4]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[5]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[6]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[7]  Osamu Watanabe,et al.  On-line sampling methods for discovering association rules , 1999 .

[8]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[9]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[10]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, Large-Scale Parallel Data Mining.

[11]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[12]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[13]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Data Mining and Knowledge Discovery.

[14]  Soon Myoung Chung,et al.  Efficient Mining of Maximal Sequential Patterns Using Multiple Samples , 2005, SDM.

[15]  Soon Myoung Chung,et al.  A Parallel Distributive Join Algorithm for Cube-Connected Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[16]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[17]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[18]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[19]  Soon Myoung Chung,et al.  A scalable algorithm for mining maximal frequent sequences using a sample , 2008, Knowledge and Information Systems.

[20]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[21]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[22]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[23]  Arbee L. P. Chen,et al.  An efficient algorithm for mining frequent sequences by a new strategy without support counting , 2004, Proceedings. 20th International Conference on Data Engineering.

[24]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[25]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Valerie Guralnik,et al.  Parallel Tree Projection Algorithm for Sequence Mining , 2001, Euro-Par.