YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark

The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays, on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The parallel and distributed computing is effective and mostly-used strategy for speeding up large scale dataset algorithms. However, the existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In this paper, we proposed YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework -- a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining. Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved 18× speedup in average for various benchmarks. Especially, we apply YAFIM in a real-world medical application to explore the relationships in medicine. It outperforms the MapReduce method around 25 times.

[1]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[2]  Zhen Liu,et al.  MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[3]  Bora Uçar,et al.  Parallel Frequent Item Set Mining with Selective Item Replication , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[5]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[6]  Min Zhang,et al.  The Strategy of Mining Association Rule Based on Cloud Computing , 2011, 2011 International Conference on Business Computing and Global Informatization.

[7]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[8]  Ling Li,et al.  Distributed data mining: a survey , 2012, Inf. Technol. Manag..

[9]  Ying Liu,et al.  Parallel Data Mining Algorithms for Association Rules and Clustering , 2007, Handbook of Parallel Computing.

[10]  Qing He,et al.  Parallel Implementation of Apriori Algorithm Based on MapReduce , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[11]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[14]  Klemens Böhm,et al.  4S: Scalable subspace search scheme overcoming traditional Apriori processing , 2013, 2013 IEEE International Conference on Big Data.

[15]  Grigorios Tsoumakas,et al.  Distributed Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Wei-keng Liao,et al.  Parallel Data Mining Algorithms for Association Rules and Clustering , 2007 .

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[22]  Bart Goethals,et al.  Frequent Itemset Mining for Big Data , 2013, 2013 IEEE International Conference on Big Data.

[23]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[24]  Suhel Hammoud,et al.  MapReduce network enabled algorithms for classification based on association rules , 2011 .

[25]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.