Hierarchical Stability-Based Model Selection for Clustering Algorithms

We present an algorithm called HS-means which is able to learn the number of clusters in a mixture model. Our method extends the concept of clustering stability to a concept of hierarchical stability. The method chooses a model for the data based on analysis of clustering stability; it then analyzes the stability of each component in the estimated model and chooses a stable model for this component. It continues this recursive stability analysis until all the estimated components are unimodal. In so doing, the method is able to handle hierarchical and symmetric data that existing stability-based algorithms have difficulty with. We test our algorithm on both synthetic datasets and real world datasets. The results show that HS-means outperforms a popular stability-based model selection algorithm, both in terms of handling symmetric data and finding high-quality clusterings in the task of predicting CPU performance.

[1]  P. M. Hartigan,et al.  Computation of the Dip Statistic to Test for Unimodality , 1985 .

[2]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[3]  Brad Calder,et al.  Using Machine Learning to Guide Architecture Simulation , 2006, J. Mach. Learn. Res..

[4]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[5]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[6]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[7]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[8]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[9]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[10]  A. McQuarrie,et al.  Regression and Time Series Model Selection , 1998 .

[11]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[12]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[13]  Max Welling,et al.  Bayesian k-Means as a Maximization-Expectation Algorithm , 2009, Neural Computation.

[14]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[15]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[16]  R. Strawderman Continuous Multivariate Distributions, Volume 1: Models and Applications , 2001 .

[17]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[18]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[19]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[20]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[21]  H. Akaike A new look at the statistical model identification , 1974 .

[22]  T. W. Anderson,et al.  Asymptotic Theory of Certain "Goodness of Fit" Criteria Based on Stochastic Processes , 1952 .

[23]  B. Schölkopf,et al.  PG-means: learning the number of clusters in data , 2007 .

[24]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[25]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[26]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[27]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[28]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[29]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[30]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[31]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .