K-means clustering: a half-century synthesis.

This paper synthesizes the results, methodology, and research conducted concerning the K-means clustering method over the last fifty years. The K-means method is first introduced, various formulations of the minimum variance loss function and alternative loss functions within the same class are outlined, and different methods of choosing the number of clusters and initialization, variable preprocessing, and data reduction schemes are discussed. Theoretic statistical results are provided and various extensions of K-means using different metrics or modifications of the original algorithm are given, leading to a unifying treatment of K-means and some of its extensions. Finally, several future studies are outlined that could enhance the understanding of numerous subtleties affecting the performance of the K-means method.

[1]  C. R. Rao,et al.  The Utilization of Multiple Measurements in Problems of Biological Classification , 1948 .

[2]  R. L. Thorndike Who belongs in the family? , 1953 .

[3]  D. Cox Note on Grouping , 1957 .

[4]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[5]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[6]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[7]  T. W. Anderson,et al.  An Introduction to Multivariate Statistical Analysis , 1959 .

[8]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[9]  George S Sebestyen,et al.  Decision-making processes in pattern recognition (ACM monograph series) , 1962 .

[10]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[11]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[12]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[13]  A W EDWARDS,et al.  A METHOD FOR CLUSTER ANALYSIS. , 1965, Biometrics.

[14]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[15]  D. F. Morrison,et al.  Multivariate Statistical Methods , 1968 .

[16]  G. H. Ball,et al.  PROMENADE - AN ON-LINE PATTERN RECOGNITION SYSTEM. , 1967 .

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  D. Wishart Fortran II programs for 8 methods of cluster analysis(clustan I) , 1969 .

[19]  J. Hartigan,et al.  Percentage Points of a Test for Clusters , 1969 .

[20]  Hrishikesh D. Vinod Mathematica Integer Programming and the Theory of Grouping , 1969 .

[21]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[22]  Robert E. Jensen,et al.  A Dynamic Programming Algorithm for Cluster Analysis , 1969, Oper. Res..

[23]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[24]  Martin D. Levine,et al.  An Algorithm for Detecting Unimodal Fuzzy Sets and Its Application as a Clustering Technique , 1970, IEEE Transactions on Computers.

[25]  PETER ELIAS,et al.  Bounds on performance of optimum quantizers , 1970, IEEE Trans. Inf. Theory.

[26]  K. Mardia Measures of multivariate skewness and kurtosis with applications , 1970 .

[27]  J. V. Ness,et al.  Admissible clustering procedures , 1971 .

[28]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[29]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[30]  B. Everitt,et al.  An Attempt at Validation of Traditional Psychiatric Syndromes by Cluster Analysis , 1971, British Journal of Psychiatry.

[31]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[32]  M. Rao Cluster Analysis and Mathematical Programming , 1971 .

[33]  J. Bezdek Cluster Validity with Fuzzy Sets , 1973 .

[34]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[35]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[36]  D. C. Bowden,et al.  MAXIMUM LIKELIHOOD ESTIMATION FOR MIXTURES OF TWO NORMAL DISTRIBUTIONS , 1973 .

[37]  A. D. Gordon 359. Note: Classification in the Presence of Constraints , 1973 .

[38]  Leon Cooper,et al.  N‐DIMENSIONAL LOCATION MODELS: AN APPLICATION TO CLUSTER ANALYSIS , 1973 .

[39]  J. V. Ness Admissible cluster procedures II , 1973 .

[40]  R. Maronna,et al.  Multivariate Clustering Procedures with Variable Metrics , 1974 .

[41]  H. Akaike A new look at the statistical model identification , 1974 .

[42]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[43]  Brian Everitt,et al.  Cluster analysis , 1974 .

[44]  K. Mardia Assessment of multinormality and the robustness of Hotelling's T^2 test , 1975 .

[45]  J. Gower Generalized procrustes analysis , 1975 .

[46]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[47]  David Barker,et al.  HIERARCHIC AND NON-HIERARCHIC GROUPING METHODS: AN EMPIRICAL COMPARISON OF TWO TECHNIQUES , 1976 .

[48]  Anil K. Jain,et al.  Clustering techniques: The user's dilemma , 1976, Pattern Recognit..

[49]  L. Hubert,et al.  A general statistical framework for assessing categorical clustering in free recall. , 1976 .

[50]  C. F. Banfield,et al.  Algorithm AS 113: A Transfer for Non-Hierarchical Classification , 1977 .

[51]  Roger K. Blashfield The equivalence of three statistical packages for performing hierarchical cluster analysis , 1977 .

[52]  A. D. Gordon,et al.  An Algorithm for Euclidean Sum of Squares Classification , 1977 .

[53]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[54]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[55]  P. Green,et al.  Analyzing multivariate data , 1978 .

[56]  Mezzich Je Evaluating clustering methods for psychiatric diagnosis. , 1978 .

[57]  J. Mezzich Evaluating clustering methods for psychiatric diagnosis. , 1978, Biological psychiatry.

[58]  J. Hartigan Asymptotic Distributions for Clustering Criteria , 1978 .

[59]  B. Everitt Unresolved Problems in Cluster Analysis , 1979 .

[60]  A. M. Stoddard,et al.  Standardization of measures prior to cluster analysis. , 1979, Biometrics.

[61]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[63]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[64]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[65]  Charles K. Bayne,et al.  Monte Carlo comparisons of selected clustering procedures , 1980, Pattern Recognit..

[66]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  G. W. Milligan,et al.  A Two-Stage Clustering Algorithm with Robust Recovery Characteristics , 1980 .

[68]  Robert F. Ling,et al.  Cluster analysis algorithms for data reduction and classification of objects , 1981 .

[69]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[70]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[71]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[72]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[73]  M. A. Wong,et al.  A Hybrid Clustering Method for Identifying High-Density Clusters , 1982 .

[74]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[75]  D. Pollard A Central Limit Theorem for $k$-Means Clustering , 1982 .

[76]  Girish N. Punj,et al.  Cluster Analysis in Marketing Research: Review and Suggestions for Application , 1983 .

[77]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[78]  Frank Plastria,et al.  Non-hierarchical clustering with masloc , 1983, Pattern Recognit..

[79]  Wayne S. DeSarbo,et al.  Constrained classification: The use of a priori information in cluster analysis , 1984 .

[80]  M. A. Wong Asymptotic properties of univariate sample k-means clusters , 2018 .

[81]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  J. Carroll,et al.  Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables , 1984 .

[83]  M. A. Wong A bootstrap testing procedure for investigating the number of subpopulations , 1985 .

[84]  J. Hartigan Statistical theory in clustering , 1985 .

[85]  H. Bock On some significance tests in cluster analysis , 1985 .

[86]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[87]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[88]  Fionn Murtagh,et al.  Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. , 1986 .

[89]  J. Carroll,et al.  Interpoint Distance Comparisons in Correspondence Analysis , 1986 .

[90]  G. Soete Optimal variable weighting for ultrametric and additive tree clustering , 1986 .

[91]  D. Hand Cluster dissection and analysis: Helmuth SPATH Wiley, Chichester, 1985, 226 pages, £25.00 , 1986 .

[92]  D. Bartholomew Latent Variable Models And Factor Analysis , 1987 .

[93]  G. W. Milligan,et al.  Methodology Review: Clustering Methods , 1987 .

[94]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[95]  L. Belbin The Use of Non-hierarchical Allocation Methods for Clustering Large Sets of Data , 1987, Aust. Comput. J..

[96]  Phipps Arabie,et al.  Combinatorial Data Analysis: Optimization by Dynamic Programming , 1987 .

[97]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[98]  J. Friedman Exploratory Projection Pursuit , 1987 .

[99]  M. P. Windham Parameter modification for clustering criteria , 1987 .

[100]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[101]  E. Fowlkes,et al.  Variable selection in clustering , 1988 .

[102]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[103]  G. Soete OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting , 1988 .

[104]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[105]  W. R. Dillon,et al.  On the Use of Component Scores in the Presence of Group Structure , 1989 .

[106]  W. Heiser,et al.  Clusteringn objects intok groups under optimal scaling of variables , 1989 .

[107]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[108]  P. Green,et al.  A preliminary study of optimal variable weighting in k-means clustering , 1990 .

[109]  B. Mirkin A sequential fitting procedure for linear data analysis models , 1990 .

[110]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[111]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[112]  S. M. Bajgier,et al.  Powers of Goodness-of-Fit Tests in Detecting Balanced Mixed Normal Distributions , 1991 .

[113]  L. Hubert,et al.  Combinatorial Data Analysis , 1992 .

[114]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[115]  Shizuhiko Nishisato,et al.  Elements of Dual Scaling: An Introduction To Practical Data Analysis , 1993 .

[116]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[117]  J. Carroll,et al.  K-means clustering in a low-dimensional Euclidean space , 1994 .

[118]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[119]  Yadolah Dodge,et al.  Complexity relaxation of dynamic programming for cluster analysis , 1994 .

[120]  P. Arabie,et al.  Cluster analysis in marketing research , 1994 .

[121]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[122]  Bernard D. Flury,et al.  Principal Points and Self-Consistent Points of Elliptical Distributions , 1995 .

[123]  J. Donoghue Univariate Screening Measures for Cluster Analysis. , 1995, Multivariate behavioral research.

[124]  P. Green,et al.  Alternative approaches to cluster-based market segmentation , 1995 .

[125]  P. Green,et al.  A Comparison of Alternative Approaches to Cluster-Based Market Segmentation , 1995 .

[126]  R. Gnanadesikan,et al.  Weighting and selection of variables for cluster analysis , 1995 .

[127]  Jim Freeman,et al.  Outliers in Statistical Data (3rd edition) , 1995 .

[128]  F. H. C. Marriott,et al.  Classification, covariance structures and repeated measurements , 1995 .

[129]  Hideyuki Imai,et al.  Exploratory Projection Pursuit for Fuzzy Data , 1995 .

[130]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[131]  G. W. Milligan,et al.  Measuring the influence of individual data points in a cluster analysis , 1996 .

[132]  G. Celeux,et al.  An entropy criterion for assessing the number of clusters in a mixture model , 1996 .

[133]  Hans-Hermann Bock,et al.  PROBABILITY MODELS AND HYPOTHESES TESTING IN PARTITIONING CLUSTER ANALYSIS , 1996 .

[134]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[135]  P. Green,et al.  An Empirical Comparison of Variable Standardization Methods in Cluster Analysis. , 1996, Multivariate behavioral research.

[136]  G. Milligan,et al.  K-Means Clustering Methods with Influence Detection , 1996 .

[137]  B. Jaumard,et al.  Minimum Sum of Squares Clustering in a Low Dimensional Space , 1996 .

[138]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[139]  P. Groenen,et al.  Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima , 1997 .

[140]  H. Kiers Discrimination by means of components that are orthogonal in the data space , 1997 .

[141]  J. Carroll,et al.  A Feature-Based Approach to Market Segmentation via Overlapping K-Centroids Clustering , 1997 .

[142]  C. J. Huberty,et al.  Behavioral Clustering of School Children. , 1997, Multivariate behavioral research.

[143]  J. Carroll,et al.  K-midranges clustering , 1998 .

[144]  Heribert Gierl,et al.  A Comparison of Traditional Segmentation Methods with Segmentation Based upon Artificial Neural Networks by Means of Conjoint Data from a Monte-Carlo-Simulation , 1998 .

[145]  Boris Mirkin,et al.  Mathematical Classification and Clustering: From How to What and Why , 1998 .

[146]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[147]  M. Mizuta Two Principal Points of Symmetric Distributions , 1998 .

[148]  P. Green,et al.  Cluster-Based Market Segmentation: Some Further Comparisons of Alternative Approaches , 1998 .

[149]  Niels G. Waller,et al.  A comparison of the classification capabilities of the 1-dimensional kohonen neural network with two pratitioning and three hierarchical cluster analysis algorithms , 1998 .

[150]  Andrea Cerioli,et al.  A New Method for Detecting Influential Observations in Nonhierarchical Cluster Analysis , 1998 .

[151]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[152]  Ali Kara,et al.  HINoV: A New Model to Improve Market Segment Definition by Identifying Noisy Variables , 1999 .

[153]  Eric W. Weisstein,et al.  The CRC concise encyclopedia of mathematics , 1999 .

[154]  C. Matrán,et al.  A central limit theorem for multivariate generalized trimmed $k$-means , 1999 .

[155]  C. Matrán,et al.  Asymptotics for trimmed k-means and associated tolerance zones 1 Research partially supported by the , 1999 .

[156]  Paul Scheunders,et al.  A competitive elliptical clustering algorithm , 1999, Pattern Recognit. Lett..

[157]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[158]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[159]  Umeshwar Dayal,et al.  K-Harmonic Means - A Data Clustering Algorithm , 1999 .

[160]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[161]  G. Loosveldt,et al.  The effects of initial values and the covariance structure on the recovery of some clustering methods , 2000 .

[162]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[163]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[164]  Bin Zhang Generalized K-Harmonic Means -- Boosting in Unsupervised Learning , 2000 .

[165]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[166]  Bin Zhang,et al.  Genera lized K- Harmonic Means - - Boosting in Unsupervised Learnin g , 2000 .

[167]  Geoffrey J. Gordon,et al.  Learning Filaments , 2000, ICML.

[168]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[169]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[170]  Ranjan Maitra,et al.  Clustering Massive Datasets With Application in Software Metrics and Tomography , 2001, Technometrics.

[171]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[172]  H. Kiers,et al.  Factorial k-means analysis for two-way data , 2001 .

[173]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[174]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[175]  E. Falkenauer,et al.  Using k-Means ? Consider ArrayMiner , 2001 .

[176]  Andrea Cerioli,et al.  Exploratory Methods for Detecting High Density Regions in Cluster Analysis , 2001 .

[177]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[178]  Juha Vesanto,et al.  Importance of Individual Variables in the k -Means Algorithm , 2001, PAKDD.

[179]  Chien-Hsing Chou,et al.  Short Papers , 2001 .

[180]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[181]  James E. Gentle,et al.  Elements of computational statistics , 2002 .

[182]  N. H. Timm Applied Multivariate Analysis , 2002 .

[183]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[184]  S. Dolnicar,et al.  An examination of indexes for determining the number of clusters in binary data sets , 2002, Psychometrika.

[185]  I. Davidson Understanding K-Means Non-hierarchical Clustering , 2002 .

[186]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[187]  Douglas Steinley,et al.  Local optima in K-means clustering: what you don't know may hurt you. , 2003, Psychological methods.

[188]  Michael K. Ng,et al.  A Note on K-modes Clustering , 2003, J. Classif..

[189]  David E. Booth,et al.  Applied Multivariate Analysis , 2003, Technometrics.

[190]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[191]  Douglas Steinley,et al.  Standardizing Variables in K -means Clustering , 2004 .

[192]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[193]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[194]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[195]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[196]  M. Brusco Clustering binary data in the presence of masking variables. , 2004, Psychological methods.

[197]  P. Warr,et al.  Copyright © The British Psychological Society Unauthorised use and reproduction in any form (including the internet and other electronic means) is prohibited without prior permission from the Society. , 2005 .