MapReduce-based efficient algorithm for finding large patterns

Finding large patterns is an objective of computational intelligence and a key step in many data mining applications, in particular in Big Data applications, where the scalability of mining algorithms is a great issue. This paper proposes an efficient algorithm Pampas that takes full advantage of the MapReduce framework in addressing the scalability issue. The novelty lies in two aspects: Pampas is the first parallel algorithm that integrates a breadth-first search strategy with a vertical mining approach, and Pampas proposes to employ different vertical formats in combination to represent the data, which improves not only scalability but also efficiency. Extensive experimental results demonstrate that the proposed algorithm outperforms the existing algorithms and scales out well with respect to database size and cluster size.

[1]  Devavrat Shah,et al.  Turbo-charging vertical mining of large databases , 2000, SIGMOD '00.

[2]  Suhel Hammoud,et al.  MapReduce network enabled algorithms for classification based on association rules , 2011 .

[3]  Lipo Wang,et al.  Data Mining With Computational Intelligence , 2006, IEEE Transactions on Neural Networks.

[4]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[5]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[6]  Nandit Soparkar,et al.  Data organization and access for efficient data mining , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[9]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[10]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[11]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[12]  Qing He,et al.  Parallel Implementation of Apriori Algorithm Based on MapReduce , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[13]  Heikki Mannila,et al.  A Perspective on Databases and Data Mining , 1995, KDD.

[14]  Zhen Liu,et al.  MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[15]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[16]  Min Zhang,et al.  The Strategy of Mining Association Rule Based on Cloud Computing , 2011, 2011 International Conference on Business Computing and Global Informatization.

[17]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[18]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[19]  Roger Champagne,et al.  Adaptation of Apriori to MapReduce to Build a Warehouse of Relations between Named Entities across the Web , 2010, 2010 Second International Conference on Advances in Databases, Knowledge, and Data Applications.

[20]  Nick Cercone,et al.  Efficient mining of frequent itemsets in social network data based on MapReduce framework , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[21]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[22]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..