A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been developed and deployed, most of them still suffer from the lack of automatic or online decision for optimal number of clusters. In this paper, we define clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering configurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods.

[1]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[2]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[3]  Anil K. Jain,et al.  Single-link characteristics of a mode-seeking clustering algorithm , 1979, Pattern Recognit..

[4]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[5]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[6]  George S. Sebestyen,et al.  Pattern recognition by an adaptive process of sample set construction , 1962, IRE Trans. Inf. Theory.

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Brian Everitt,et al.  Cluster analysis , 1974 .

[9]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[10]  B. S. Duran,et al.  Cluster Analysis: A Survey , 1974 .

[11]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[12]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[13]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[14]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[15]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[16]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[17]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[18]  Robert S. Bennett,et al.  The intrinsic dimensionality of signal collections , 1969, IEEE Trans. Inf. Theory.

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[21]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[22]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[23]  Roderick Urquhart,et al.  Graph theoretical clustering based on limited neighbourhood sets , 1982, Pattern Recognit..

[24]  G. Krishna,et al.  Agglomerative clustering using the concept of mutual nearest neighbourhood , 1978, Pattern Recognit..

[25]  Josef Kittler,et al.  A locally sensitive method for cluster analysis , 1976, Pattern Recognit..

[26]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[27]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[28]  Pierluigi Crescenzi,et al.  A compendium of NP optimization problems , 1994, WWW Spring 1994.

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[30]  RICHARD C. DUBES,et al.  How many clusters are best? - An experiment , 1987, Pattern Recognit..

[31]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[32]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[33]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[34]  R. Ng,et al.  Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[35]  R. Jancey Multidimensional group analysis , 1966 .

[36]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[37]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[39]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[40]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[41]  Robert E. Jensen,et al.  A Dynamic Programming Algorithm for Cluster Analysis , 1969, Oper. Res..

[42]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[43]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[44]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[45]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[46]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[47]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..