Relating Clustering Stability to Properties of Cluster Boundaries

In this paper, we investigate stability-based methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the one hand we show that stability can be upper bounded by certain properties of the optimal clustering, namely by the mass in a small tube around the cluster boundaries. On the other hand, we provide counterexamples which show that a reverse statement is not true in general. Finally, we give some examples and arguments why, from a theoretic point of view, using clustering stability in a high sample setting can be problematic. It can be seen that distribution-free guarantees bounding the difference between the finite sample stability and the “true stability” cannot exist, unless one makes strong assumptions on the underlying distribution.

[1]  Gregory Shakhnarovich,et al.  An investigation of computational and informational limits in Gaussian mixture clustering , 2006, ICML '06.

[2]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[3]  Shai Ben-David,et al.  A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[4]  Sandrine Dudoit,et al.  Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of , 2001 .

[5]  D. Pollard A Central Limit Theorem for $k$-Means Clustering , 1982 .

[6]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[7]  Ulrike von Luxburg,et al.  Consistent Minimization of Clustering Objective Functions , 2007, NIPS.

[8]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[9]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.

[10]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[11]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[12]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[14]  Giorgio Valentini,et al.  Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[15]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[16]  Paul E. Green,et al.  A cautionary note on using internal cross validation to select the number of clusters , 1999 .

[17]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[18]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[19]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.