Frequent Itemset Mining for Big Data

Frequent Itemset Mining (FIM) is one of the most well known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, we investigate the applicability of FIM techniques on the MapReduce platform. We introduce two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments we show the scalability of our methods.

[1]  Jin Chang,et al.  Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[2]  Suhel Hammoud,et al.  MapReduce network enabled algorithms for classification based on association rules , 2011 .

[3]  Gregory R. Andrews,et al.  Foundations of Multithreaded, Parallel, and Distributed Programming , 1999 .

[4]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[5]  Henrik Grosskreutz,et al.  Approximating the number of frequent sets in dense data , 2009, Knowledge and Information Systems.

[6]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[7]  Tijl De Bie,et al.  An information theoretic framework for data mining , 2011, KDD.

[8]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[9]  Bart Goethals,et al.  Tight upper bounds on the number of candidate patterns , 2005, TODS.

[10]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[11]  Bora Uçar,et al.  Parallel Frequent Item Set Mining with Selective Item Replication , 2011, IEEE Transactions on Parallel and Distributed Systems.

[12]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[13]  Qing He,et al.  Parallel Implementation of Apriori Algorithm Based on MapReduce , 2012, SNPD.

[14]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[15]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[16]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[17]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[18]  C. Bauckhage,et al.  Analyzing Social Bookmarking Systems : A del . icio . us Cookbook , 2008 .

[19]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Wei-keng Liao,et al.  Parallel Data Mining Algorithms for Association Rules and Clustering , 2007 .

[22]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[23]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[24]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-memory Systems , 1998 .

[25]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[26]  Tao Luo,et al.  Effective personalization based on association rule discovery from web usage data , 2001, WIDM '01.

[27]  Ling Li,et al.  Distributed data mining: a survey , 2012, Inf. Technol. Manag..

[28]  Hubert Kadima,et al.  Searching Frequent Itemsets by Clustering Data: Towards a Parallel Approach Using Mapreduce , 2011, WISE Workshops.

[29]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[30]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[31]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.