A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments

The advancement of electronic technology enables us to collect logs from various devices. Such logs require detailed analysis in order to be broadly useful. Data mining is a technique that has been widely used to extract hidden information from such data. Data mining is mainly composed of association rules mining, sequent pattern mining, classification and clustering. Association rules mining has attracted significant attention and been successfully applied to various fields. Although the past studies can effectively discover frequent patterns to deduce association rules, execution efficiency is still a critical problem. To speed up execution, many methods using parallel and distributed computing technology have been proposed in recent years. Most of the past studies focused on parallelizing the workload in a high end machine or in distributed computing environments like grid or cloud computing systems; however, very few of them discuss how to efficiently determine the appropriate number of computing nodes, considering execution efficiency and load balancing. An intuition is that execution speed is proportional to the number of computing nodes-that is, more the number of computing nodes, faster is the execution speed. However, this is incorrect for such algorithms because of the inherently algorithmic design. Allocating too many computing nodes can lead to high execution time. In addition to the execution inefficiency, inappropriate resource allocation is a waste of computing power and network bandwidth. At the same time, load cannot be effectively distributed if there are too few nodes allocated. In this paper, we propose a fast, load balancing and resource efficient algorithm named FLR-Mining for discovering frequent patterns in distributed computing systems. FLR-Mining is capable of determining the appropriate number of computing nodes automatically and achieving better load balancing as compared with existing methods. Through empirical evaluation, FLR-Mining is shown to deliver excellent performance in terms of execution efficiency and load balancing. The algorithm is parameter-less and able to determine the appropriate number of computing nodes automatically.The algorithm requires only 54.3% execution time of PSWS when using 20 nodes.The algorithm achieves better load balancing as compared with existing methods.The nodes do not need to exchange any transactions or sub databases with each other.

[1]  Jiayi Zhou,et al.  Balanced Tidset-based Parallel FP-tree Algorithm for the Frequent Pattern Mining on Grid System , 2008, 2008 Fourth International Conference on Semantics, Knowledge and Grid.

[2]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[3]  Kawuu W. Lin,et al.  Efficient algorithms for frequent pattern mining in many-task computing environments , 2013, Knowl. Based Syst..

[4]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[5]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[6]  Chih-Ping Chu,et al.  Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments , 2015, Int. J. Parallel Emergent Distributed Syst..

[7]  Mario Cannataro,et al.  Distributed data mining on the grid , 2002, Future Gener. Comput. Syst..

[8]  Jiayi Zhou,et al.  Tidset-Based Parallel FP-tree Algorithm for the Frequent Pattern Mining Problem on PC Clusters , 2008, GPC.

[9]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[10]  Yue-Shi Lee,et al.  The Studies of Mining Frequent Patterns Based on Frequent Pattern Tree , 2009, PAKDD.

[11]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Dan Zhang,et al.  TidFP: Mining Frequent Patterns in Different Databases with Transaction ID , 2009, DaWaK.

[13]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[14]  Elena Baralis,et al.  P-Mine: Parallel itemset mining on large datasets , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[15]  Kawuu W. Lin,et al.  A fast parallel algorithm for discovering frequent patterns , 2009, 2009 IEEE International Conference on Granular Computing.

[16]  Qing He,et al.  Distributed data mining in grid computing environments , 2007, Future Gener. Comput. Syst..

[17]  Zhongzhi Shi,et al.  DH-TRIE frequent pattern mining on Hadoop using JPA , 2011, 2011 IEEE International Conference on Granular Computing.

[18]  José Hernández Palancar,et al.  Distributed and Shared Memory Algorithm for Parallel Mining of Association Rules , 2007, MLDM.

[19]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[20]  Shirish Tatikonda,et al.  Toward terabyte pattern mining: an architecture-conscious solution , 2007, PPoPP.

[21]  Jesús S. Aguilar-Ruiz,et al.  Gene association analysis: a survey of frequent pattern mining from gene expression data , 2010, Briefings Bioinform..

[22]  Gösta Grahne,et al.  Mining frequent itemsets from secondary memory , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[23]  Reda Alhajj,et al.  DRFP-tree: disk-resident frequent pattern tree , 2009, Applied Intelligence.

[24]  Frans Coenen,et al.  Social Network Trend Analysis Using Frequent Pattern Mining and Self Organizing Maps , 2010, SGAI Conf..

[25]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[26]  Tzung-Pei Hong,et al.  A load-balanced distributed parallel mining algorithm , 2010, Expert Syst. Appl..

[27]  Wolfgang Lehner,et al.  Memory-efficient frequent-itemset mining , 2011, EDBT/ICDT '11.

[28]  Shi Zhongzhi,et al.  An Efficient Data Mining Framework on Hadoop using Java Persistence API , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[29]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[30]  Zhen Liu,et al.  MapReduce as a programming model for association rules algorithm on Hadoop , 2010, The 3rd International Conference on Information Sciences and Interaction Sciences.

[31]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[32]  Lan Vu,et al.  Novel parallel method for mining frequent patterns on multi-core shared memory systems , 2013, DISCS-2013.

[33]  Yong Qiu,et al.  An improved algorithm of mining from FP-tree , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[34]  Chad Creighton,et al.  Mining gene expression databases for association rules , 2003, Bioinform..

[35]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[36]  Laks V. S. Lakshmanan,et al.  Discovering leaders from community actions , 2008, CIKM '08.

[37]  Vincent S. Tseng,et al.  Mining and validating gene expression patterns: An integrated approach and applications , 2003 .