Data clustering: 50 years beyond K-means

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.

[1]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  E. Backer,et al.  Cluster analysis by optimal decomposition of induced fuzzy sets , 1978 .

[4]  Anil K. Jain,et al.  Landscape of clustering algorithms , 2004, ICPR 2004.

[5]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[6]  Stephen J. Roberts,et al.  Minimum-Entropy Data Clustering Using Reversible Jump Markov Chain Monte Carlo , 2001, ICANN.

[7]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[8]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[12]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[13]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[14]  Rainer Fuchs,et al.  Topology of gene expression networks as revealed by data mining and modeling , 2003, Bioinform..

[15]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[16]  Yi Liu,et al.  BoostCluster: boosting clustering by pairwise constraints , 2007, KDD '07.

[17]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[18]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[19]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[20]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[21]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.

[22]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[23]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[24]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[25]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[28]  Anil K. Jain,et al.  Writer Adaptation for Online Handwriting Recognition , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  S. Hodge,et al.  Statistics and Probability , 1972 .

[30]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[33]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[34]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[36]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[37]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[38]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[39]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[40]  Marcello Pelillo,et al.  Dominant Sets and Pairwise Clustering , 2007 .

[41]  Arindam Banerjee,et al.  Multi-way Clustering on Relation Graphs , 2007, SDM.

[42]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[43]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[44]  J. V. Ness,et al.  Admissible clustering procedures , 1971 .

[45]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[46]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[47]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[50]  A.K. Jain,et al.  Scars, marks and tattoos (SMT): Soft biometric for suspect and victim identification , 2008, 2008 Biometrics Symposium.

[51]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[52]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[53]  T. Motzkin,et al.  Maxima for Graphs and a New Proof of a Theorem of Turán , 1965, Canadian Journal of Mathematics.

[54]  Jianying Hu,et al.  K-means clustering of proportional data using L1 distance , 2008, 2008 19th International Conference on Pattern Recognition.

[55]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[56]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[57]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[58]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[59]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[60]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994 .

[61]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[62]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[63]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[64]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[65]  Roberto Todeschini,et al.  The data analysis handbook , 1994, Data handling in science and technology.

[66]  Jitender S. Deogun,et al.  Conceptual clustering in information retrieval , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[67]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[68]  Jianying Hu,et al.  Regularized Co-Clustering with Dual Supervision , 2008, NIPS.

[69]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[70]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[71]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[72]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[73]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[74]  Anil K. Jain,et al.  Clustering techniques: The user's dilemma , 1976, Pattern Recognit..

[75]  Zhengdong Lu,et al.  Penalized Probabilistic Clustering , 2007, Neural Computation.

[76]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[77]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[78]  Taku Kudo,et al.  Clustering graphs by weighted substructure mining , 2006, ICML.

[79]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[80]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[81]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[82]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[83]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[84]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[85]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[86]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[87]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[88]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1996, IEEE Trans. Neural Networks.

[89]  B. Tabachnick,et al.  Using multivariate statistics, 5th ed. , 2007 .

[90]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[91]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[92]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[93]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[94]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[95]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[96]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[97]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[98]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[99]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[100]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[101]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[102]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[103]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[104]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[105]  Lawrence O. Hall,et al.  A Scalable Framework For Segmenting Magnetic Resonance Images , 2009, J. Signal Process. Syst..

[106]  Jianying Hu,et al.  Statistical methods for automated generation of service engagement staffing plans , 2007, IBM J. Res. Dev..

[107]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[108]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[109]  Anil K. Jain,et al.  Testing for Uniformity in Multidimensional Data , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[110]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[111]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[112]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[113]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[114]  U von Luxburg,et al.  Towards a Statistical Theory of Clustering. Presented at the PASCAL workshop on clustering, London , 2005 .

[115]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[116]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[117]  Mário A. T. Figueiredo,et al.  Clustering Under Prior Knowledge with Application to Image Segmentation , 2006, NIPS.

[118]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[119]  Brian Everitt,et al.  Cluster analysis , 1974 .

[120]  D. Critchlow Metric Methods for Analyzing Partially Ranked Data , 1986 .

[121]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[122]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[123]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[124]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[125]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[126]  G.B. Coleman,et al.  Image segmentation by clustering , 1979, Proceedings of the IEEE.

[127]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[128]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[129]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[130]  C. L. Mallows NON-NULL RANKING MODELS. I , 1957 .

[131]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[132]  Vipin Kumar,et al.  Discovery of climate indices using clustering , 2003, KDD '03.

[133]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[134]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[135]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[136]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[137]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[138]  Joachim M. Buhmann,et al.  Cluster analysis of heterogeneous rank data , 2007, ICML '07.

[139]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[140]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[141]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[142]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .