A Short Survey on Data Clustering Algorithms

With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains, for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology, Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Olfa Nasraoui,et al.  A New Gravitational Clustering Algorithm , 2003, SDM.

[3]  Yue Li,et al.  Probabilistic Inference on Multiple Normalized Signal Profiles from Next Generation Sequencing: Transcription Factor Binding Sites , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Armen Aghajanyan,et al.  Gravitational Clustering , 2015, ArXiv.

[5]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[6]  Sergio Greco,et al.  A Hierarchical Algorithm for Clustering Uncertain Data via an Information-Theoretic Approach , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[7]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[8]  Alex Waibel,et al.  Readings in speech recognition , 1990 .

[9]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[10]  Pradipta Maji,et al.  Fuzzy–Rough Supervised Attribute Clustering Algorithm and Classification of Microarray Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Junlin Li,et al.  Molecular dynamics-like data clustering approach , 2011, Pattern Recognit..

[12]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[13]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[14]  Isaac E. Lagaris,et al.  Newtonian clustering: An approach based on molecular dynamics and global optimization , 2007, Pattern Recognit..

[15]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[16]  Yue Li,et al.  Herd Clustering: A synergistic data clustering approach using collective intelligence , 2014, Appl. Soft Comput..

[17]  B. Frey,et al.  Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs , 2005, Nature Genetics.

[18]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[19]  Kwong-Sak Leung,et al.  Generalizing and learning protein-DNA binding sequence representations by an evolutionary algorithm , 2011, Soft Comput..

[20]  Miin-Shen Yang,et al.  Alternative c-means clustering algorithms , 2002, Pattern Recognit..

[21]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[22]  Thomas Lukasiewicz Proceedings of the 2nd International Conference on Scalable Uncertainty Management‚ SUM 2008‚ Naples‚ Italy‚ October 1−3‚ 2008 , 2008 .

[23]  Zhaolei Zhang,et al.  SignalSpider: probabilistic pattern discovery on multiple normalized ChIP-Seq signal profiles , 2015, Bioinform..

[24]  Teofilo F. Gonzalez,et al.  On the computational complexity of clustering and related problems , 1982 .

[25]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[27]  Xiaogang Wang,et al.  CLUES: A non-parametric clustering method based on local shrinking , 2007, Comput. Stat. Data Anal..

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Lianwen Jin,et al.  A New Simplified Gravitational Clustering Method for Multi-prototype Learning Based on Minimum Classification Error Training , 2006, IWICPAS.

[30]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[31]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[32]  Pietro Liò,et al.  Collective Human Mobility Pattern from Taxi Trips in Urban Area , 2012, PloS one.

[33]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[34]  Andrea Tagarelli,et al.  Uncertain Centroid based Partitional Clustering of Uncertain Data , 2012, Proc. VLDB Endow..

[35]  Palma Blonda,et al.  A survey of fuzzy clustering algorithms for pattern recognition. I , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[36]  Hau-San Wong,et al.  A Comparison Study for DNA Motif Modeling on Protein Binding Microarray , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[38]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[39]  Andrea Tagarelli,et al.  Clustering Uncertain Data Via K-Medoids , 2008, SUM.

[40]  Korris Fu-Lai Chung,et al.  Generalized Fuzzy C-Means Clustering Algorithm With Improved Fuzzy Partitions , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[41]  Zhaolei Zhang,et al.  SNPdryad: predicting deleterious non-synonymous human SNPs using only orthologous protein sequences , 2014, Bioinform..

[42]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[43]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[44]  Douglas Fisher Optimization and Simplification of Hierarchical Clusterings , 1995, KDD.

[45]  A. Banerjee,et al.  A Simple Model of Herd Behavior , 1992 .

[46]  Jiang-She Zhang,et al.  Robust clustering by pruning outliers , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[47]  Kwong-Sak Leung,et al.  Effect of Spatial Locality on an Evolutionary Algorithm for Multimodal Optimization , 2010, EvoApplications.

[48]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[49]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[50]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[51]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[52]  Zhaolei Zhang,et al.  Evolutionary multimodal optimization using the principle of locality , 2012, Inf. Sci..

[53]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[54]  B. Lazzerini,et al.  A fuzzy relational clustering algorithm based on a dissimilarity measure extracted from data , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[55]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[56]  Wei Hu,et al.  Unsupervised Active Learning Based on Hierarchical Graph-Theoretic Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[57]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[58]  James C. Bezdek,et al.  Fuzzy Kohonen clustering networks , 1994, Pattern Recognit..