Parallel tree-projection-based sequence mining algorithms

Discovery of sequential patterns is becoming increasingly useful and essential in many scientific and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand efficient, scalable, and parallel algorithms. Even though a number of algorithms have been developed to efficiently parallelize frequent pattern discovery algorithms that are based on the candidate-generation-and-counting framework, the problem of parallelizing the more efficient projection-based algorithms has received relatively little attention and existing parallel formulations have been targeted only toward shared-memory architectures. The irregular and unstructured nature of the task-graph generated by these algorithms and the fact that these tasks operate on overlapping sub-databases makes it challenging to efficiently parallelize these algorithms on scalable distributed-memory parallel computing architectures. In this paper we present and study a variety of distributed-memory parallel algorithms for a tree-projection-based frequent sequence discovery algorithm that are able to minimize the various overheads associated with load imbalance, database overlap, and interprocessor communication. Our experimental evaluation on a 32 processor IBM SP show that these algorithms are capable of achieving good speedups, substantially reducing the amount of the required work to find sequential patterns in large databases.

[1]  Vipin Kumar,et al.  Multilevel Algorithms for Multi-Constraint Graph Partitioning , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[3]  Andreas Mueller,et al.  Fast sequential and parallel algorithms for association rule mining: a comparison , 1995 .

[4]  A. J. M. van Gasteren,et al.  Derivation of a Termination Detection Algorithm for Distributed Computations , 1983, Inf. Process. Lett..

[5]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..

[6]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[7]  Charu C. Aggarwal,et al.  A Tree Projection Algorithm for Generation of Frequent Item Sets , 2001, J. Parallel Distributed Comput..

[8]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[9]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[10]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[11]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[12]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[13]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[14]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[15]  Mohammed J. Zaki Efficient enumeration of frequent sequences , 1998, CIKM '98.

[16]  Vijay P. Kumar,et al.  Analyzing Scalability of Parallel Algorithms and Architectures , 1994, J. Parallel Distributed Comput..

[17]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[19]  Masaru Kitsuregawa,et al.  Hash based parallel algorithms for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[20]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[21]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[22]  Osmar R. Zaïane,et al.  Fast parallel association rule mining without candidacy generation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[24]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[25]  Vipin Kumar,et al.  Unstructured Tree Search on SIMD Parallel Computers , 1994, IEEE Trans. Parallel Distributed Syst..

[26]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[27]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[28]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[29]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[30]  George Karypis,et al.  LPMiner: an algorithm for finding frequent itemsets using length-decreasing support constraint , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[31]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[32]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.