论文信息 - On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset - 字舞流文

On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of data-resolution scales over which a clustering solution persists; it is quantified in terms of the maximum over two-norms of all the associated cluster-covariance matrices. Thus we associate a persistence value for each element in a set of clustering solutions with different number of clusters. We show that the datasets where natural clusters are a priori known, the clustering solutions that identify the natural clusters are most persistent - in this way, this notion can be used to identify solutions with true number of clusters. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, $X$-means, $G$-means, $PG$-means, dip-means algorithms and information-theoretic method, in accurately identifying the clustering solutions with true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm, where the number of distinct cluster centers changes (bifurcates) with respect to an annealing parameter.

Mayank Baranwal | Srinivasa M. Salapaka | Amber Srivastava | S. Salapaka | Mayank Baranwal | Amber Srivastava

[1] Joachim M. Buhmann,et al. Stability-Based Model Selection , 2002, NIPS.

[2] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[3] M. Stephens. EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[4] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[5] Puneet Sharma,et al. A Scalable Approach to Combinatorial Library Design for Drug Discovery , 2008, J. Chem. Inf. Model..

[6] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[7] Vincenzo Catania,et al. An evolutionary fuzzy c-means approach for clustering of bio-informatics databases , 2008, 2008 IEEE International Conference on Fuzzy Systems (IEEE World Congress on Computational Intelligence).

[8] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[9] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10] atherine,et al. Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[11] Inderjit S. Dhillon,et al. Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[12] Robert Tibshirani,et al. Cluster Validation by Prediction Strength , 2005 .

[13] Pasi Fränti,et al. K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[14] Kuo-Chen Hung,et al. Intuitionistic fuzzy $$c$$c-means clustering algorithm with neighborhood attraction in segmenting medical image , 2015, Soft Comput..

[15] Max Welling,et al. Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[16] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[17] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[18] L. Wasserman,et al. A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[19] Kocsis Zoltán Tamás,et al. IEEE World Congress on Computational Intelligence , 2019, IEEE Computational Intelligence Magazine.

[20] Andrew W. Moore,et al. Repairing Faulty Mixture Models using Density Estimation , 2001, ICML.

[21] Xiaogang Wang,et al. A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[22] Hirotugu Akaike,et al. Akaike's Information Criterion , 2011, International Encyclopedia of Statistical Science.

[23] Fei Yuan,et al. Data Density Correlation Degree Clustering Method for Data Aggregation in WSN , 2014, IEEE Sensors Journal.

[24] Greg Hamerly,et al. PG-means: learning the number of clusters in data , 2006, NIPS.

[25] Hae-Sang Park,et al. A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[26] J. Rissanen,et al. Modeling By Shortest Data Description* , 1978, Autom..

[27] Eytan Domany,et al. Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[28] Argyris Kalogeratos,et al. Dip-means: an incremental clustering method for estimating the number of clusters , 2012, NIPS.

[29] J. Hartigan,et al. The Dip Test of Unimodality , 1985 .

[30] Greg Hamerly,et al. Learning the k in k-means , 2003, NIPS.

[31] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[32] Pasi Fränti,et al. Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[33] LinKuo-Ping,et al. Intuitionistic fuzzy $$c$$c-means clustering algorithm with neighborhood attraction in segmenting medical image , 2015, SOCO 2015.