Data mining in distributed environment: a survey

Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining (DM) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216

[1]  Ossama Younis,et al.  Distributed clustering in ad-hoc sensor networks: a hybrid, energy-efficient approach , 2004, IEEE INFOCOM 2004.

[2]  Tzung-Pei Hong,et al.  Fast updated frequent-itemset lattice for transaction deletion , 2015, Data Knowl. Eng..

[3]  Srinivasan Parthasarathy,et al.  Parallel Algorithms for Discovery of Association Rules , 1997, Data Mining and Knowledge Discovery.

[4]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[5]  Huseyin Polat,et al.  Privacy-preserving SOM-based recommendations on horizontally distributed data , 2012, Knowl. Based Syst..

[6]  Matthias Klusch,et al.  Distributed data mining and agents , 2005, Eng. Appl. Artif. Intell..

[7]  Sheng Zhong,et al.  Privacy-preserving algorithms for distributed mining of frequent itemsets , 2007, Inf. Sci..

[8]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[9]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[10]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[11]  Jian Wang,et al.  Mining Uncertain Sequential Patterns in Iterative MapReduce , 2015, PAKDD.

[12]  Mohammed J. Zaki Parallel Sequence Mining on Shared-Memory Machines , 1999, J. Parallel Distributed Comput..

[13]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[14]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[15]  Srinivasan Parthasarathy,et al.  Parallel Data Mining for Association Rules on Shared-Memory Multi-Processors , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[16]  Hoda Mashayekhi,et al.  GDCluster: A General Decentralized Clustering Algorithm , 2015, IEEE Transactions on Knowledge and Data Engineering.

[17]  Tzung-Pei Hong,et al.  A lattice-based approach for mining most generalization association rules , 2013, Knowl. Based Syst..

[18]  Xiaokui Xiao,et al.  Large-scale frequent subgraph mining in MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[19]  Huseyin Polat,et al.  Privacy-preserving hybrid collaborative filtering on cross distributed data , 2011, Knowledge and Information Systems.

[20]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[21]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[22]  Yichuan Jiang,et al.  A Survey of Task Allocation and Load Balancing in Distributed Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[23]  Wei Lu,et al.  CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Engelbert Mephu Nguifo,et al.  CMRules: Mining sequential rules common to several sequences , 2012, Knowl. Based Syst..

[25]  Changjie Tang,et al.  PartSpan: Parallel Sequence Mining of Trajectory Patterns , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[26]  Cory J. Butz,et al.  A Foundational Approach to Mining Itemset Utilities from Databases , 2004, SDM.

[27]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[28]  Min Luo,et al.  Bootstrapping K-means for big data analysis , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[29]  Bernard Toursel,et al.  Distributed Data Mining , 2001, Scalable Comput. Pract. Exp..

[30]  Qing He,et al.  Distributed data mining in grid computing environments , 2007, Future Gener. Comput. Syst..

[31]  M. B. Malik,et al.  Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects , 2012, 2012 Third International Conference on Computer and Communication Technology.

[32]  Tzung-Pei Hong,et al.  RWFIM: Recent weighted-frequent itemsets mining , 2015, Eng. Appl. Artif. Intell..

[33]  Valerie Guralnik,et al.  Parallel tree-projection-based sequence mining algorithms , 2004, Parallel Comput..

[34]  Dong Hoon Lee,et al.  Privacy-preserving disjunctive normal form operations on distributed sets , 2013, Inf. Sci..

[35]  David A. Padua,et al.  Parallel mining of closed sequential patterns , 2005, KDD '05.

[36]  Tzung-Pei Hong,et al.  A load-balanced distributed parallel mining algorithm , 2010, Expert Syst. Appl..

[37]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[38]  Jie Zhao,et al.  MapReduce-Based H-Mine Algorithm , 2015, 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC).

[39]  Idit Keidar,et al.  Distributed data clustering in sensor networks , 2011, Distributed Computing.

[40]  Philip S. Yu,et al.  UP-Growth: an efficient algorithm for high utility itemset mining , 2010, KDD.

[41]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[42]  Maria-Florina Balcan,et al.  Distributed PCA and k-Means Clustering , 2013 .

[43]  Soon Myoung Chung,et al.  Distributed Mining of Maximal Frequent Itemsets on a Data Grid System , 2006, The Journal of Supercomputing.

[44]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[45]  Andreas Mueller,et al.  Fast sequential and parallel algorithms for association rule mining: a comparison , 1995 .

[46]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[47]  Bart Goethals,et al.  Frequent Itemset Mining for Big Data , 2013, 2013 IEEE International Conference on Big Data.

[48]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[49]  J. A. López del Val,et al.  Principal Components Analysis , 2018, Applied Univariate, Bivariate, and Multivariate Statistics Using Python.

[50]  Panos Kalnis,et al.  Parallel motif extraction from very long sequences , 2013, CIKM.

[51]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[52]  Elena Baralis,et al.  P-Mine: Parallel itemset mining on large datasets , 2013, 2013 IEEE 29th International Conference on Data Engineering Workshops (ICDEW).

[53]  G. Karypis,et al.  Parallel Algorithms for Mining Sequential Associations : Issues and Challenges , 2000 .

[54]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[55]  Hong Jiang,et al.  VSFS: A Searchable Distributed File System , 2014, 2014 9th Parallel Data Storage Workshop.

[56]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[57]  Ting Yu,et al.  DPcode: Privacy-Preserving Frequent Visual Patterns Publication on Cloud , 2016, IEEE Transactions on Multimedia.

[58]  David A. Cieslak,et al.  Troubleshooting Distributed Systems via Data Mining. , 2006 .

[59]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[60]  Lazaros Gkatzikis,et al.  Migrate or not? exploiting dynamic task migration in mobile cloud computing systems , 2013, IEEE Wireless Communications.

[61]  Kun Liu,et al.  Multi-party, Privacy-Preserving Distributed Data Mining Using a Game Theoretic Framework , 2007, PKDD.

[62]  Ming-Yen Lin,et al.  Apriori-based frequent itemset mining algorithms on MapReduce , 2012, ICUIMC.

[63]  Rainer Gemulla,et al.  LASH: Large-Scale Sequence Mining with Hierarchies , 2015, SIGMOD Conference.

[64]  Anthony K. H. Tung,et al.  Efficiently extracting frequent subgraphs using MapReduce , 2013, 2013 IEEE International Conference on Big Data.

[65]  K. Thangavel,et al.  Distributed Data Clustering: A Comparative Analysis , 2009, Foundations of Computational Intelligence.

[66]  David Wai-Lok Cheung,et al.  Efficient Mining of Association Rules in Distributed Databases , 1996, IEEE Trans. Knowl. Data Eng..

[67]  Feng Zhang,et al.  Privacy-Preserving Two-Party Distributed Association Rules Mining on Horizontally Partitioned Data , 2013, 2013 International Conference on Cloud Computing and Big Data.

[68]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[69]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing for the Distributed Mining of Molecular Structures , 2006, IEEE Transactions on Parallel and Distributed Systems.

[70]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[71]  Eli Upfal,et al.  PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce , 2012, CIKM.

[72]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[73]  David A. Cieslak,et al.  Short Paper: Troubleshooting Distributed Systems via Data Mining , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[74]  Srinivasan Parthasarathy,et al.  Parallel and distributed methods for incremental frequent itemset mining , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[75]  S. R,et al.  Data Mining with Big Data , 2017, 2017 11th International Conference on Intelligent Systems and Control (ISCO).

[76]  Minghua Chen,et al.  Enabling Multilevel Trust in Privacy Preserving Data Mining , 2011, IEEE Transactions on Knowledge and Data Engineering.

[77]  R. Brightwell,et al.  Differences Between Distributed and Parallel Systems , 1998 .

[78]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[79]  Elena Baralis,et al.  PaMPa-HD: A Parallel MapReduce-Based Frequent Pattern Miner for High-Dimensional Data , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[80]  Bin Shao,et al.  Fast graph mining with HBase , 2015, Inf. Sci..

[81]  Guillaume Pierre,et al.  Challenges in very large distributed systems , 2011, Journal of Internet Services and Applications.

[82]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[83]  Reza Akbarinia,et al.  A highly scalable parallel algorithm for maximally informative k-itemset mining , 2016, Knowledge and Information Systems.

[84]  Masaru Kitsuregawa,et al.  Hash based parallel algorithms for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[85]  Ran Wolff,et al.  Communication-Efficient Distributed Mining of Association Rules , 2001, SIGMOD '01.

[86]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[87]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[88]  Klara Nahrstedt,et al.  Optimal resource allocation in wireless ad hoc networks: a price-based approach , 2006, IEEE Transactions on Mobile Computing.

[89]  Vincent Cho,et al.  Distributed Mining of Classification Rules , 2002, Knowledge and Information Systems.

[90]  Chunxiao Jiang,et al.  Information Security in Big Data: Privacy and Data Mining , 2014, IEEE Access.

[91]  Weiming Shen,et al.  A distributed frequent itemset mining algorithm using Spark for Big Data analytics , 2015, Cluster Computing.

[92]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[93]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[94]  Keqiu Li,et al.  Efficient $k$ -Means++ Approximation with MapReduce , 2014, IEEE Trans. Parallel Distributed Syst..

[95]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[96]  Srinivasan Parthasarathy,et al.  Parallel algorithms for mining frequent structural motifs in scientific data , 2004, ICS '04.

[97]  Ming-Syan Chen,et al.  DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences , 2012, Knowledge and Information Systems.

[98]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[99]  Tamir Tassa,et al.  Secure Mining of Association Rules in Horizontally Distributed Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[100]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[101]  Joseph K. Liu,et al.  Toward efficient and privacy-preserving computing in big data era , 2014, IEEE Network.

[102]  Liu Bi Survey on distributed data mining , 2014 .

[103]  Philip S. Yu,et al.  Efficient parallel data mining for association rules , 1995, CIKM '95.

[104]  Yuni Xia,et al.  Distributed Sequential Pattern Mining in Large Scale Uncertain Databases , 2016, PAKDD.

[105]  Rong Gu,et al.  YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[106]  Manohar Kaul,et al.  R-Apriori: An Efficient Apriori based Algorithm on Spark , 2015, PIKM@CIKM.

[107]  Frans Coenen,et al.  A new method for mining Frequent Weighted Itemsets based on WIT-trees , 2013, Expert Syst. Appl..

[108]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[109]  Huajun Chen,et al.  MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network , 2009, APPT.

[110]  Georgios B. Giannakis,et al.  Distributed Clustering Using Wireless Sensor Networks , 2011, IEEE Journal of Selected Topics in Signal Processing.

[111]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[112]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[113]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[114]  Hemanta Kumar Bhuyan,et al.  Privacy preserving sub-feature selection in distributed data mining , 2015, Appl. Soft Comput..

[115]  Muhammad Shiraz,et al.  Big Data: Survey, Technologies, Opportunities, and Challenges , 2014, TheScientificWorldJournal.

[116]  Tzung-Pei Hong,et al.  An Incremental High-Utility Mining Algorithm with Transaction Insertion , 2015, TheScientificWorldJournal.

[117]  Chun-Cheng Lin,et al.  A fast and distributed algorithm for mining frequent patterns in congested networks , 2015, Computing.

[118]  David A. Padua,et al.  A sampling-based framework for parallel data mining , 2005, PPoPP.

[119]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[120]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[121]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[122]  Thorsten Meinl,et al.  Mining Molecular Datasets on Symmetric Multiprocessor Systems , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[123]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[124]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[125]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[126]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[127]  Vincent S. Tseng,et al.  RuleGrowth: mining sequential rules common to several sequences by pattern-growth , 2011, SAC.

[128]  Ling Li,et al.  Distributed data mining: a survey , 2012, Inf. Technol. Manag..

[129]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[130]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[131]  Lei Chen,et al.  Optimal Resource Placement in Structured Peer-to-Peer Networks , 2010, IEEE Transactions on Parallel and Distributed Systems.

[132]  Shaojie Qiao,et al.  Parallel Sequential Pattern Mining of Massive Trajectory Data , 2010, Int. J. Comput. Intell. Syst..

[133]  Jiming Liu,et al.  Agent-based load balancing on homogeneous minigrids: macroscopic modeling and characterization , 2005, IEEE Transactions on Parallel and Distributed Systems.

[134]  Eleonora Riva Sanseverino,et al.  Distributed, Collaborative Data Analysis from Heterogeneous Sites Using a Scalable Evolutionary Technique , 2001, Applied Intelligence.

[135]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[136]  Yichuan Jiang,et al.  Understanding Social Networks From a Multiagent Perspective , 2014, IEEE Transactions on Parallel and Distributed Systems.

[137]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[138]  Charu C. Aggarwal,et al.  A Tree Projection Algorithm for Generation of Frequent Item Sets , 2001, J. Parallel Distributed Comput..

[139]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[140]  Jeffrey Xu Yu,et al.  Scalable sequential pattern mining for biological sequences , 2004, CIKM '04.

[141]  Constantinos Kolias,et al.  RuleMR: Classification rule discovery with MapReduce , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[142]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[143]  Le Song,et al.  Communication Efficient Distributed Kernel Principal Component Analysis , 2015, KDD.

[144]  Salvatore J. Stolfo,et al.  Adaptive Intrusion Detection: A Data Mining Approach , 2000, Artificial Intelligence Review.

[145]  Tzung-Pei Hong,et al.  An effective mining approach for up-to-date patterns , 2009, Expert Syst. Appl..

[146]  Srinivasan Parthasarathy,et al.  A Survey of Distributed Mining of Data Streams , 2007, Data Streams - Models and Algorithms.