MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability

The management and analysis of big data has been identified as one of the most important emerging needs in recent years. This is because of the sheer volume and increasing complexity of data being created or collected. Current clustering algorithms can not handle big data, and therefore, scalable solutions are necessary. Since fuzzy clustering algorithms have shown to outperform hard clustering approaches in terms of accuracy, this paper investigates the parallelization and scalability of a common and effective fuzzy clustering algorithm named fuzzy c-means (FCM) algorithm. The algorithm is parallelized using the MapReduce paradigm outlining how the Map and Reduce primitives are implemented. A validity analysis is conducted in order to show that the implementation works correctly achieving competitive purity results compared to state-of-the art clustering algorithms. Furthermore, a scalability analysis is conducted to demonstrate the performance of the parallel FCM implementation with increasing number of computing nodes used.

[1]  何耀彬,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013 .

[2]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[3]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[4]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[5]  Denis J. Dean,et al.  Comparison of neural networks and discriminant analysis in predicting forest cover types , 1998 .

[6]  B. Eswara Reddy,et al.  A hybrid approach to speed-up the k-means clustering method , 2012, International Journal of Machine Learning and Cybernetics.

[7]  Rong Jin,et al.  Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[8]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[9]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[10]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[11]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[14]  Daoqiang Zhang,et al.  Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation , 2007, Pattern Recognit..

[15]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[16]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Xindong Wu,et al.  K-Means Clustering with Bagging and MapReduce , 2011, 2011 44th Hawaii International Conference on System Sciences.

[18]  Lakhmi C. Jain,et al.  Evolutionary Computation in Data Mining (Studies in Fuzziness and Soft Computing) , 2005 .

[19]  Miin-Shen Yang A survey of fuzzy clustering , 1993 .

[20]  James C. Bezdek,et al.  Optimization of clustering criteria by reformulation , 1995, IEEE Trans. Fuzzy Syst..

[21]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[22]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[23]  Xiaoping Li,et al.  MapReduce Based Method for Big Data Semantic Clustering , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  Félix de Moya Anegón,et al.  Comparison of neural models for document clustering , 2003, Int. J. Approx. Reason..

[26]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[27]  J. Bezdek,et al.  Detection and Characterization of Cluster Substructure II. Fuzzy c-Varieties and Convex Combinations Thereof , 1981 .

[28]  T. Pavlidis,et al.  Fuzzy sets and their applications to cognitive and decision processes , 1977 .

[29]  ABDUL RAUF BAIG,et al.  Review of Classification Using Genetic Programming , 2010 .

[30]  Al Geist,et al.  PVM (Parallel Virtual Machine) , 2011, Encyclopedia of Parallel Computing.

[31]  Yong Yang,et al.  Image Segmentation by Fuzzy C-Means Clustering Algorithm with a Novel Penalty Term , 2007, Comput. Artif. Intell..

[32]  William Gropp,et al.  MPI (Message Passing Interface) , 2011, Encyclopedia of Parallel Computing.

[33]  Hsuan-Shih Lee,et al.  Automatic clustering of business processes in business systems planning , 1999, Eur. J. Oper. Res..

[34]  S. Nair,et al.  Clustering with Apache Hadoop , 2011, ICWET.

[35]  Ping Zhou,et al.  Large-Scale Data Sets Clustering Based on MapReduce and Hadoop , 2011 .

[36]  Lakhmi C. Jain,et al.  Evolutionary computation in data mining , 2005 .

[37]  Myrian C. A. Costa,et al.  Parallel Fuzzy c-Means Cluster Analysis , 2006, VECPAR.

[38]  Simone A. Ludwig Clonal selection based fuzzy C-means algorithm for clustering , 2014, GECCO.