Penalized K-Means Algorithms for Finding the Number of Clusters

In many applications we want to find the number of clusters in a dataset. A common approach is to use a penalized k-means algorithm with an additive penalty term linear in the number of clusters. Obviously, the number of discovered clusters depends critically on the value of the coefficient of the penalty term, and an open problem is estimating the value of the coefficient in a principled manner. In this paper, we derive rigorous bounds for the coefficient of the additive penalty in k-means for ideal clusters. Although in practice clusters typically deviate from the ideal assumption, the ideal case serves as a useful guideline. Furthermore, we investigate k-means with multiplicative penalty, which generally produces a more reliable signature, compared to additive penalty, for the correct number of clusters in cases where the ideal cluster assumption holds. We also empirically investigate certain types of deviations from ideal cluster assumption. In such cases both types of penalties may suggest multiple, ambiguous solutions. We present a consensus-based approach to resolving these ambiguous solutions by combining the results of additive and multiplicative penalties.

[1]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Alexandre Galvão Patriota,et al.  A non-parametric method to estimate the number of clusters , 2014, Comput. Stat. Data Anal..

[6]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[7]  S. Shankar Sastry,et al.  Generalized Principal Component Analysis , 2016, Interdisciplinary applied mathematics.

[8]  Wei Fu,et al.  Estimating the Number of Clusters Using Cross-Validation , 2017, Journal of Computational and Graphical Statistics.

[9]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[10]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[13]  Behrooz Kamgar-Parsi,et al.  Penalized k-means algorithms for finding the correct number of clusters in a dataset , 2019, ArXiv.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[16]  Behrooz Kamgar-Parsi,et al.  Clustering with neural networks , 1990, Biological Cybernetics.

[17]  郝立丽,et al.  A Criterion for Determining the Number of Clusters , 2008 .

[18]  M. Strzelecki,et al.  Mazda - a software for texture analysis , 2007, 2007 International Symposium on Information Technology Convergence (ISITC 2007).

[19]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[20]  Wotao Yin,et al.  A Parallel Method for Earth Mover’s Distance , 2018, J. Sci. Comput..

[21]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.