Data clustering: 50 years beyond K-means

The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms in to taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping objects according to measured or perceived intrinsic characteristics. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes cluster analysis (unsupervised learning) from discriminant analysis (supervised learning). The objective of cluster analysis is to simply find a convenient and valid organization of the data, not to establish rules for separating future data into categories.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  C. L. Mallows NON-NULL RANKING MODELS. I , 1957 .

[3]  H. Ross Principles of Numerical Taxonomy , 1964 .

[4]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[5]  T. Motzkin,et al.  Maxima for Graphs and a New Proof of a Theorem of Turán , 1965, Canadian Journal of Mathematics.

[6]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[9]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[10]  J. V. Ness,et al.  Admissible clustering procedures , 1971 .

[11]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[12]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[13]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[14]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[15]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[16]  Anil K. Jain,et al.  Clustering techniques: The user's dilemma , 1976, Pattern Recognit..

[17]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  E. Backer,et al.  Cluster analysis by optimal decomposition of induced fuzzy sets , 1978 .

[20]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[22]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[23]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[24]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[25]  Anil K. Jain,et al.  Testing for Uniformity in Multidimensional Data , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  D. Critchlow Metric Methods for Analyzing Partially Ranked Data , 1986 .

[27]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[28]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[29]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[30]  Shinji Umeyama,et al.  An Eigendecomposition Approach to Weighted Graph Matching Problems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[32]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[33]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[34]  Roberto Todeschini,et al.  The data analysis handbook , 1994, Data handling in science and technology.

[35]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[36]  R. Bagozzi Advanced Methods of Marketing Research , 1994 .

[37]  Nilanjan Ray,et al.  Pattern Recognition Letters , 1995 .

[38]  Takenobu Tokunaga,et al.  Cluster-based text categorization: a comparison of category search strategies , 1995, SIGIR '95.

[39]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[40]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[41]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[42]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1996, IEEE Trans. Neural Networks.

[43]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[44]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[45]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Ghazi Rabihavi David , 1997 .

[47]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[49]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[50]  Thomas Hofmann,et al.  Statistical Models for Co-occurrence Data , 1998 .

[51]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[52]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[53]  Daphne Koller,et al.  Using machine learning to improve information access , 1998 .

[54]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[55]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[56]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[57]  Jitender S. Deogun,et al.  Conceptual clustering in information retrieval , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[58]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[59]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[60]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[61]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[62]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[64]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[65]  Toby Walsh,et al.  Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000 , 2000, ICML.

[66]  H. Prosper Bayesian Analysis , 2000, hep-ph/0006356.

[67]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[68]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[69]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[70]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[71]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[72]  R. Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[73]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[74]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[75]  Gerhard Rigoll,et al.  Writer Adaptation for Online Handwriting Recognition , 2001, DAGM-Symposium.

[76]  Husayn Tabatabai,et al.  Shi , 2001, The Poetry of Cao Zhi.

[77]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[78]  Stephen J. Roberts,et al.  Minimum-Entropy Data Clustering Using Reversible Jump Markov Chain Monte Carlo , 2001, ICANN.

[79]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[80]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[81]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[82]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[83]  G. W. Hatfield,et al.  DNA microarrays and gene expression , 2002 .

[84]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[85]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[86]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[87]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[88]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[89]  Rainer Fuchs,et al.  Topology of gene expression networks as revealed by data mining and modeling , 2003, Bioinform..

[90]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[91]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[92]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[93]  Vipin Kumar,et al.  Discovery of climate indices using clustering , 2003, KDD '03.

[94]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[95]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[96]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[97]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[98]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[99]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[100]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[101]  G. Widmer,et al.  ON THE EVALUATION OF PERCEPTUAL SIMILARITY MEASURES FOR MUSIC , 2003 .

[102]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[103]  Joachim M. Buhmann,et al.  Landscape of clustering algorithms , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[104]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[105]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[106]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[107]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[108]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[109]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[110]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[111]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[112]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[113]  Chris H. Q. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering , 2005, SDM.

[114]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[115]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[116]  Padhraic Smyth,et al.  A Spectral Clustering Approach To Finding Communities in Graph , 2005, SDM.

[117]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[118]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[119]  U von Luxburg,et al.  Towards a Statistical Theory of Clustering. Presented at the PASCAL workshop on clustering, London , 2005 .

[120]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[121]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[122]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[123]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[124]  Mário A. T. Figueiredo,et al.  Clustering Under Prior Knowledge with Application to Image Segmentation , 2006, NIPS.

[125]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[126]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[127]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[128]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[129]  Taku Kudo,et al.  Clustering graphs by weighted substructure mining , 2006, ICML.

[130]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[131]  B. Tabachnick,et al.  Using multivariate statistics, 5th ed. , 2007 .

[132]  Arindam Banerjee,et al.  Multi-way Clustering on Relation Graphs , 2007, SDM.

[133]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[134]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[135]  Joachim M. Buhmann,et al.  Cluster analysis of heterogeneous rank data , 2007, ICML '07.

[136]  Somnath Banerjee,et al.  Clustering short texts using wikipedia , 2007, SIGIR.

[137]  M. Pelillo,et al.  Dominant Sets and Pairwise Clustering , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[138]  Edward A. Fox,et al.  Recent Developments in Document Clustering , 2007 .

[139]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[140]  Yi Liu,et al.  BoostCluster: boosting clustering by pairwise constraints , 2007, KDD '07.

[141]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.

[142]  Zhengdong Lu,et al.  Penalized Probabilistic Clustering , 2007, Neural Computation.

[143]  Jianying Hu,et al.  Statistical methods for automated generation of service engagement staffing plans , 2007, IBM J. Res. Dev..

[144]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[145]  Jianying Hu,et al.  K-means clustering of proportional data using L1 distance , 2008, 2008 19th International Conference on Pattern Recognition.

[146]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[147]  Jianying Hu,et al.  Regularized Co-Clustering with Dual Supervision , 2008, NIPS.

[148]  K. Fernow New York , 1896, American Potato Journal.

[149]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[150]  A.K. Jain,et al.  Scars, marks and tattoos (SMT): Soft biometric for suspect and victim identification , 2008, 2008 Biometrics Symposium.

[151]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[152]  松田 直人 『Google Scholar』の利点 , 2009 .

[153]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[154]  Stability-based Validation of Clustering , 2009, Encyclopedia of Database Systems.

[155]  Lawrence O. Hall,et al.  A Scalable Framework For Segmenting Magnetic Resonance Images , 2009, J. Signal Process. Syst..

[156]  A. Mubaidin Jordan , 2010, Practical Neurology.