Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms

Many clustering and segmentation algorithms both suffer from the limitation that the number of clusters/segments is specified by a human user. It is often impractical to expect a human with sufficient domain knowledge to be available to select the number of clusters/segments to return. We investigate techniques to determine the number of clusters or segments to return from hierarchical clustering and segmentation algorithms. We propose an efficient algorithm, the L method that finds the "knee" in a '# of clusters vs. clustering evaluation metric' graph. Using the knee is well-known, but is not a particularly well-understood method to determine the number of clusters. We explore the feasibility of this method, and attempt to determine in which situations it will and will not work. We also compare the L method to existing methods based on the accuracy of the number of clusters that are determined and efficiency. Our results show favorable performance for these criteria compared to the existing methods that were evaluated.

[1]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[2]  D. Hess,et al.  An objective analysis of the pressure-volume curve in the acute respiratory distress syndrome. , 2000, American journal of respiratory and critical care medicine.

[3]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[4]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[5]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[7]  E G Anderson The kindest cut. , 1996, Postgraduate medicine.

[8]  Osmar R. Zaïane,et al.  A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[9]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[10]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[11]  Jonathan J. Oliver,et al.  The Kindest Cut: Minimum Message Length Segmentation , 1996, ALT.

[12]  Hannu Toivonen,et al.  Estimating the number of segments in time series data using permutation tests , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[13]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Philip Chan,et al.  Learning States and Rules for Time Series Anomaly Detection , 2004, FLAIRS.

[16]  Philip Chan,et al.  Learning States and Rules for Detecting Anomalies in Time Series , 2005, Applied Intelligence.

[17]  Masashi Sugiyama,et al.  Subspace Information Criterion for Model Selection , 2001, Neural Computation.

[18]  P. Diehl,et al.  Least-Squares Fitting , 1972 .

[19]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[20]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[21]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[24]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[25]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[26]  Joachim M. Buhmann,et al.  A Resampling Approach to Cluster Validation , 2002, COMPSTAT.

[27]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.