A fast and low idle time method for mining frequent patterns in distributed and many-task computing environments

Association rules mining has attracted much attention among data mining topics because it has been successfully applied in various fields to find the association between purchased items by identifying frequent patterns (FPs). Currently, databases are huge, ranging in size from terabytes to petabytes. Although past studies can effectively discover FPs to deduce association rules, the execution efficiency is still a critical problem, particularly for big data. Progressive size working set (PSWS) and parallel FP-growth (PFP) are state-of-the-art methods that have been applied successfully to parallel and distributed computing technology to improve mining processing time in many-task computing, thereby bridging the gap between high-throughput and high-performance computing. However, such methods cannot mine before obtaining a complete FP-tree or the corresponding subdatabase, causing a high idle time for computing nodes. We propose a method that can begin mining when a small part of an FP-tree is received. The idle time of computing nodes can be reduced, and thus, the time required for mining can be reduced effectively. Through an empirical evaluation, the proposed method is shown to be faster than PSWS and PFP.

[1]  Chih-Ping Chu,et al.  Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments , 2015, Int. J. Parallel Emergent Distributed Syst..

[2]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[3]  Rong Gu,et al.  YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[4]  Benjamin C. M. Fung,et al.  Parallel Eclat for Opportunistic Mining of Frequent Itemsets , 2015, DEXA.

[5]  Kawuu W. Lin,et al.  Efficient algorithms for frequent pattern mining in many-task computing environments , 2013, Knowl. Based Syst..

[6]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[7]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[8]  Kawuu W. Lin,et al.  A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments , 2015, Future Gener. Comput. Syst..

[9]  Srinivasan Parthasarathy,et al.  Stratification driven placement of complex data: A framework for distributed data analytics , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Jiayi Zhou,et al.  Balanced Tidset-based Parallel FP-tree Algorithm for the Frequent Pattern Mining on Grid System , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[11]  Shiow-Yang Wu,et al.  Sequence-Growth: A Scalable and Effective Frequent Itemset Mining Algorithm for Big Data Based on MapReduce Framework , 2015, 2015 IEEE International Congress on Big Data.

[12]  Salvatore Orlando,et al.  Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[14]  Chun-Cheng Lin,et al.  A fast and distributed algorithm for mining frequent patterns in congested networks , 2015, Computing.

[15]  Zhen Liu,et al.  MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[16]  Srinivasan Parthasarathy,et al.  Towards a parameter-free and parallel itemset mining algorithm in linearithmic time , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[17]  Kawuu W. Lin,et al.  A novel parallel algorithm for frequent pattern mining with privacy preserved in cloud computing environments , 2010, Int. J. Ad Hoc Ubiquitous Comput..

[18]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[19]  Philip S. Yu,et al.  Bag Constrained Structure Pattern Mining for Multi-Graph Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Wolfgang Lehner,et al.  Memory-efficient frequent-itemset mining , 2011, EDBT/ICDT '11.

[21]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[22]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[23]  Jiayi Zhou,et al.  Tidset-Based Parallel FP-tree Algorithm for the Frequent Pattern Mining Problem on PC Clusters , 2008, GPC.

[24]  Yong Qiu,et al.  An improved algorithm of mining from FP-tree , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[25]  Lan Vu,et al.  Novel parallel method for mining frequent patterns on multi-core shared memory systems , 2013, DISCS-2013.

[26]  Gösta Grahne,et al.  Mining frequent itemsets from secondary memory , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[27]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[28]  Elena Baralis,et al.  P-Mine: Parallel itemset mining on large datasets , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[29]  Osman Hegazy,et al.  AN EFFICIENT IMPLEMENTATION OF APRIORI ALGORITHM BASED ON HADOOP-MAPREDUCE MODEL , 2012 .

[30]  Zhongzhi Shi,et al.  DH-TRIE frequent pattern mining on Hadoop using JPA , 2011, 2011 IEEE International Conference on Granular Computing.

[31]  Kun Zhang,et al.  Iterative sampling based frequent itemset mining for big data , 2015, Int. J. Mach. Learn. Cybern..

[32]  Yang Song,et al.  Smart Cache: An Optimized MapReduce Implementation of Frequent Itemset Mining , 2015, 2015 IEEE International Conference on Cloud Engineering.

[33]  Reda Alhajj,et al.  DRFP-tree: disk-resident frequent pattern tree , 2009, Applied Intelligence.

[34]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[35]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[36]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[37]  Weiming Shen,et al.  A distributed frequent itemset mining algorithm using Spark for Big Data analytics , 2015, Cluster Computing.

[38]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[39]  Yue-Shi Lee,et al.  The Studies of Mining Frequent Patterns Based on Frequent Pattern Tree , 2009, PAKDD.

[40]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[41]  Dan Zhang,et al.  TidFP: Mining Frequent Patterns in Different Databases with Transaction ID , 2009, DaWaK.

[42]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[43]  Bart Goethals,et al.  Frequent Itemset Mining for Big Data , 2013, 2013 IEEE International Conference on Big Data.

[44]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[45]  Fabrizio Silvestri,et al.  WebDocs: a real-life huge transactional dataset , 2004, FIMI.

[46]  Roger Eggen,et al.  Java Versus MPI in a Distributed Environment , 1999, PDPTA.