PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

[1]  Ruoming Jin,et al.  Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[2]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[3]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[4]  Eric Li,et al.  Optimization of Frequent Itemset Mining on Multiple-Core Processor , 2007, VLDB.

[5]  David A. Padua,et al.  A sampling-based framework for parallel data mining , 2005, PPoPP.

[6]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[9]  Srinivasan Parthasarathy,et al.  Efficient progressive sampling for association rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Suhel Hammoud,et al.  MapReduce network enabled algorithms for classification based on association rules , 2011 .

[11]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[12]  Jin Chang,et al.  Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Zhen Liu,et al.  MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[15]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[16]  Bora Uçar,et al.  Parallel Frequent Item Set Mining with Selective Item Replication , 2011, IEEE Transactions on Parallel and Distributed Systems.

[17]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[18]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[19]  Eli Upfal,et al.  Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees , 2011, TKDD.

[20]  Nikolaos V. Sahinidis,et al.  Exact Algorithms for Global Optimization of Mixed-Integer Nonlinear Programs , 2002 .

[21]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[22]  Jimmy J. Lin,et al.  Design patterns for efficient graph algorithms in MapReduce , 2010, MLG '10.

[23]  Ruoming Jin,et al.  Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. , 2002 .

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Eli Upfal,et al.  Mining top-K frequent itemsets through progressive sampling , 2010, Data Mining and Knowledge Discovery.

[26]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[27]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[28]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[29]  Min Zhang,et al.  The Strategy of Mining Association Rule Based on Cloud Computing , 2011, 2011 International Conference on Business Computing and Global Informatization.

[30]  Bingsheng He,et al.  Parallel Data Mining on Graphics Processors , 2011 .

[31]  Raj P. Gopalan,et al.  Effective Sampling for Mining Association Rules , 2004, Australian Conference on Artificial Intelligence.

[32]  Roger Champagne,et al.  Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web , 2010, 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications.

[33]  Osmar R. Zaïane,et al.  Parallel leap: large-scale maximal pattern mining in a distributed environment , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[34]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.