On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of data-resolution scales over which a clustering solution persists; it is quantified in terms of the maximum over two-norms of all the associated cluster-covariance matrices. Thus we associate a persistence value for each element in a set of clustering solutions with different number of clusters. We show that the datasets where natural clusters are a priori known, the clustering solutions that identify the natural clusters are most persistent - in this way, this notion can be used to identify solutions with true number of clusters. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, $X$-means, $G$-means, $PG$-means, dip-means algorithms and information-theoretic method, in accurately identifying the clustering solutions with true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm, where the number of distinct cluster centers changes (bifurcates) with respect to an annealing parameter.

[1]  Joachim M. Buhmann,et al.  Stability-Based Model Selection , 2002, NIPS.

[2]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[3]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[4]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[5]  Puneet Sharma,et al.  A Scalable Approach to Combinatorial Library Design for Drug Discovery , 2008, J. Chem. Inf. Model..

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Vincenzo Catania,et al.  An evolutionary fuzzy c-means approach for clustering of bio-informatics databases , 2008, 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence).

[8]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[11]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[12]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[13]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[14]  Kuo-Chen Hung,et al.  Intuitionistic fuzzy $$c$$c-means clustering algorithm with neighborhood attraction in segmenting medical image , 2015, Soft Comput..

[15]  Max Welling,et al.  Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[16]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[17]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[18]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[19]  Kocsis Zoltán Tamás,et al.  IEEE World Congress on Computational Intelligence , 2019, IEEE Computational Intelligence Magazine.

[20]  Andrew W. Moore,et al.  Repairing Faulty Mixture Models using Density Estimation , 2001, ICML.

[21]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[22]  Hirotugu Akaike,et al.  Akaike's Information Criterion , 2011, International Encyclopedia of Statistical Science.

[23]  Fei Yuan,et al.  Data Density Correlation Degree Clustering Method for Data Aggregation in WSN , 2014, IEEE Sensors Journal.

[24]  Greg Hamerly,et al.  PG-means: learning the number of clusters in data , 2006, NIPS.

[25]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[26]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[27]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[28]  Argyris Kalogeratos,et al.  Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[29]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[30]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[31]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[32]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[33]  LinKuo-Ping,et al.  Intuitionistic fuzzy $$c$$c-means clustering algorithm with neighborhood attraction in segmenting medical image , 2015, SOCO 2015.