A Comprehensive Survey of Clustering Algorithms

Data analysis is used as a common method in modern science research, which is across communication science, computer science and biology science. Clustering, as the basic composition of data analysis, plays a significant role. On one hand, many tools for cluster analysis have been created, along with the information increase and subject intersection. On the other hand, each clustering algorithm has its own strengths and weaknesses, due to the complexity of information. In this review paper, we begin at the definition of clustering, take the basic elements involved in the clustering process, such as the distance or similarity measurement and evaluation indicators, into consideration, and analyze the clustering algorithms from two perspectives, the traditional ones and the modern ones. All the discussed clustering algorithms will be compared in detail and comprehensively shown in Appendix Table 22.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[4]  Gilles Brassard,et al.  Quantum clustering algorithms , 2007, ICML '07.

[5]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[6]  Jure Leskovec,et al.  Mining of Massive Datasets: MapReduce and the New Software Stack , 2014 .

[7]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[8]  Eric J. Alm,et al.  Distribution-Based Clustering: Using Ecology To Refine the Operational Taxonomic Unit , 2013, Applied and Environmental Microbiology.

[9]  B. Kulkarni,et al.  An ant colony approach for clustering , 2004 .

[10]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[11]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[12]  David Horn,et al.  The Method of Quantum Clustering , 2001, NIPS.

[13]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[14]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[15]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[18]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[20]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[21]  Paul D. McNicholas,et al.  Model-based clustering of microarray expression data via latent Gaussian mixture models , 2010, Bioinform..

[22]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[23]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[25]  Facundo Mémoli,et al.  Characterization, Stability and Convergence of Hierarchical Clustering Methods , 2010, J. Mach. Learn. Res..

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[28]  Dervis Karaboga,et al.  A survey: algorithms simulating bee swarm intelligence , 2009, Artificial Intelligence Review.

[29]  Marina Meila,et al.  A Comparison of Spectral Clustering Algorithms , 2003 .

[30]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[31]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. I , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[32]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[33]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[34]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[35]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[36]  David Horn,et al.  Novel Clustering Algorithm for Microarray Expression Data in A Truncated SVD Space , 2003, Bioinform..

[37]  T. Velmurugan,et al.  A Survey of Partition based Clustering Algorithms in Data Mining: An Experimental Approach , 2011 .

[38]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[39]  W. Kinsner,et al.  Multifractal characterization for classification of network traffic , 2004, Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No.04CH37513).

[40]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[41]  Baldo Faieta,et al.  Diversity and adaptation in populations of clustering ants , 1994 .

[42]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[44]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Markus Peura,et al.  The Self-Organizing Map of Trees , 1998, Neural Processing Letters.

[46]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[47]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[48]  Stephen Grossberg,et al.  A massively parallel architecture for a self-organizing neural pattern recognition machine , 1988, Comput. Vis. Graph. Image Process..

[49]  Sang-Ho Lee,et al.  Heterogeneous Clustering Ensemble Method for Combining Different Cluster Results , 2006, BioDM.

[50]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[51]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[52]  Sandro Vega-Pons,et al.  Weighted partition consensus via kernels , 2010, Pattern Recognit..

[53]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[54]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[55]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[56]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[57]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[58]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[59]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[60]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[61]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Ali Maroosi,et al.  Application of shuffled frog-leaping algorithm on clustering , 2009 .

[64]  Tommy W. S. Chow,et al.  A new shifting grid clustering algorithm , 2004, Pattern Recognit..

[65]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[66]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[67]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[68]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[69]  Dimitris K. Tasoulis,et al.  Unsupervised distributed clustering , 2004, Parallel and Distributed Computing and Networks.

[70]  Fei Wang,et al.  Fast affinity propagation clustering: A multilevel approach , 2012, Pattern Recognit..

[71]  Colin Fyfe,et al.  The kernel self-organising map , 2000, KES'2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516).

[72]  Zahir Tari,et al.  A distributed aggregation and fast fractal clustering approach for SOAP traffic , 2014, J. Netw. Comput. Appl..

[73]  Ajith Abraham,et al.  Swarm Intelligence Algorithms for Data Clustering , 2008, Soft Computing for Knowledge Discovery and Data Mining.

[74]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[75]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[76]  Abdol Hamid Pilevar,et al.  GCHL: A grid-clustering algorithm for high-dimensional very large spatial data bases , 2005, Pattern Recognit. Lett..

[77]  Stephen Grossberg,et al.  Art 2: Self-Organization Of Stable Category Recognition Codes For Analog Input Patterns , 1988, Other Conferences.

[78]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[79]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[80]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[81]  Bin Jiang,et al.  Clustering Uncertain Data Based on Probability Distribution Similarity , 2013, IEEE Transactions on Knowledge and Data Engineering.

[82]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[83]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[84]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[85]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[86]  Lian Duan,et al.  A Local Density Based Spatial Clustering Algorithm with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[87]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[88]  Stephen Grossberg,et al.  The ART of adaptive pattern recognition by a self-organizing neural network , 1988, Computer.

[89]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[90]  R. Yager,et al.  Approximate Clustering Via the Mountain Method , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[91]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[92]  Aidong Zhang,et al.  A fractal-based clustering approach in large visual database systems , 2004, Multimedia Tools and Applications.

[93]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[94]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[95]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[96]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[97]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[98]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[99]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[101]  Chi-Hoon Lee,et al.  Clustering spatial data when facing physical constraints , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[102]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[103]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[104]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[105]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.

[106]  Bin Zhao,et al.  Multiple Kernel Clustering , 2009, SDM.

[107]  Stephen Grossberg,et al.  ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures , 1990, Neural Networks.

[108]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[109]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[110]  Daniel A. Menascé,et al.  Fractal Characterization of Web Workloads , 2002 .

[111]  David Harel,et al.  Clustering spatial data using random walks , 2001, KDD '01.

[112]  Jong-Sheng Cherng,et al.  A hypergraph based clustering algorithm for spatial data sets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[113]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[114]  San Murugesan Web engineering , 1999, LINK.

[115]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[116]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[117]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[118]  Brendan J. Frey,et al.  A Binary Variable Model for Affinity Propagation , 2009, Neural Computation.

[119]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[120]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[121]  Miguel Á. Carreira-Perpiñán,et al.  Constrained spectral clustering through affinity propagation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[122]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[123]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[124]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[125]  Ickjai Lee,et al.  AMOEBA: HIERARCHICAL CLUSTERING BASED ON SPATIAL PROXIMITY USING DELAUNATY DIAGRAM , 2000 .

[126]  Miin-Shen Yang A survey of fuzzy clustering , 1993 .

[127]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[128]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[129]  David Horn,et al.  Dynamic quantum clustering: a method for visual exploration of structures in data , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[130]  Dervis Karaboga,et al.  A novel clustering approach: Artificial Bee Colony (ABC) algorithm , 2011, Appl. Soft Comput..

[131]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[132]  Donald C. Wunsch,et al.  A Comparison Study of Validity Indices on Swarm-Intelligence-Based Clustering , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[133]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[134]  Kurt Hornik,et al.  Voting-Merging: An Ensemble Method for Clustering , 2001, ICANN.

[135]  Zhongdong Wu,et al.  Fuzzy C-means clustering algorithm based on kernel method , 2003, Proceedings Fifth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA 2003.

[136]  David Horn,et al.  Clustering via Hilbert space , 2001 .

[137]  Assaf Gottlieb,et al.  Algorithm for data clustering in pattern recognition problems based on quantum mechanics. , 2001, Physical review letters.

[138]  Zohar Yakhini,et al.  Clustering gene expression patterns , 1999, J. Comput. Biol..

[139]  Julia Handl,et al.  Ant-based and swarm-based clustering , 2007, Swarm Intelligence.

[140]  C. Sparrow The Fractal Geometry of Nature , 1984 .

[141]  Joydeep Ghosh,et al.  CONSENSUS-BASED ENSEMBLES OF SOFT CLUSTERINGS , 2008, MLMTA.

[142]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[143]  Rajesh N. Davé,et al.  Adaptive fuzzy c-shells clustering and detection of ellipses , 1992, IEEE Trans. Neural Networks.