Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences

Frequent sequence mining is well known and well studied problem in datamining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to <inline-formula><tex-math notation="LaTeX">$\approx 3/4\cdot P$</tex-math><alternatives> <inline-graphic xlink:type="simple" xlink:href="kessl-ieq1-2515622.gif"/></alternatives></inline-formula> where <inline-formula><tex-math notation="LaTeX"> $P$</tex-math><alternatives><inline-graphic xlink:type="simple" xlink:href="kessl-ieq2-2515622.gif"/></alternatives> </inline-formula> is the number of processors. In the experimental evaluation, we show that our method performs significantly better then the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.

[1]  Jiayi Zhou,et al.  Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system , 2010, Expert Syst. Appl..

[2]  Vasek Chvátal,et al.  The tail of the hypergeometric distribution , 1979, Discret. Math..

[3]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[4]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[5]  David A. Padua,et al.  A sampling-based framework for parallel data mining , 2005, PPoPP.

[6]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[7]  Pavel Tvrdík,et al.  Toward more parallel frequent itemset mining algorithms , 2007 .

[8]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[9]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[10]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[11]  Valerie Guralnik,et al.  Dynamic Load Balancing Algorithms for Sequence Mining , 2001 .

[12]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[13]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[14]  Yuan Dong,et al.  A Parallel Algorithm Based on Prefix Tree for Sequence Pattern Mining , 2010, 2010 First ACIS International Symposium on Cryptography, and Network Security, Data Mining and Knowledge Discovery, E-Commerce and Its Applications, and Embedded Systems.

[15]  M. Skala Hypergeometric tail inequalities: ending the insanity , 2013, 1311.5939.

[16]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[17]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[18]  Robert Kessl,et al.  Static Load Balancing of Parallel Mining of Frequent Itemsets Using Reservoir Sampling , 2011, MLDM.

[19]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[20]  Dimitrios Gunopulos,et al.  Discovering All Most Specific Sentences by Randomized Algorithms , 1997, ICDT.

[21]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[22]  Valerie Guralnik,et al.  Parallel Tree Projection Algorithm for Sequence Mining , 2001, Euro-Par.

[23]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[24]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[25]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..