论文信息 - An Efficient Algorithm of Frequent Itemsets Mining Based on MapReduce

An Efficient Algorithm of Frequent Itemsets Mining Based on MapReduce

Mainstream parallel algorithms for mining frequent itemsets (patterns) were designed by implementing FP-Growth or Apriori algorithms on MapReduce (MR) framework. Existing MR FP-Growth algorithms can not distribute data equally among nodes, and MR Apriori algorithms utilize multiple map/reduce procedures and generate too many key-value pairs with value of 1; these disadvantages hinder their performance. This paper proposes an algorithm FIMMR: it firstly mines local frequent itemsets for each data chunk as candidates, applies prune strategies to the candidates, and then identifies global frequent itemsets from candidates. Experimental results show that the time efficiency of FIMMR outperforms PFP and SPC significantly; and under small minimum support threshold, FIMMR can achieve one order of magnitude improvement than the other two algorithms; meanwhile, the speedup of FIMMR is also satisfactory.

Jing Zhang | Le Wang | Lin Feng | Pengyu Liao

[1] Edward Y. Chang,et al. Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[2] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3] Ming-Yen Lin,et al. Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[4] Eli Upfal,et al. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[5] Jian Pei,et al. H-Mine: Fast and space-preserving frequent pattern mining in large databases , 2007 .

[6] Zhen Liu,et al. MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[7] Roger Champagne,et al. Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web , 2010, 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications.

[8] Lin Feng,et al. Sliding Window-based Frequent Itemsets Mining over Data Streams using Tail Pointer Table , 2014, Int. J. Comput. Intell. Syst..

[9] Zvi M. Kedem,et al. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[10] Jin Chang,et al. Balanced parallel FP-Growth with MapReduce , 2010, 2010 IEEE Youth Conference on Information, Computing and Telecommunications.

[11] Lin Feng,et al. UT-Tree: Efficient mining of high utility itemsets from data streams , 2013, Intell. Data Anal..

[12] Chunfeng Yuan,et al. PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[13] Johannes Gehrke,et al. MAFIA: a maximal frequent itemset algorithm , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15] Tzung-Pei Hong,et al. DBV-Miner: A Dynamic Bit-Vector approach for fast mining frequent closed itemsets , 2012, Expert Syst. Appl..

[16] Young-Koo Lee,et al. Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases , 2009, IEEE Transactions on Knowledge and Data Engineering.