A meta-learning approach for determining the number of clusters with consideration of nearest neighbors

An important and challenging problem in data clustering is the determination of the best number of clusters. A variety of estimation methods has been proposed over the years to address this problem. Most of these methods depend on several nontrivial assumptions about the data structure; and such methods may thus fail to discover the true clusters in a dataset that does not satisfy those assumptions. We develop a new approach that takes as a starting point the simple and intuitive observation that close objects should fall within the same cluster, whereas distant ones should not. Based on this simple notion we utilize a new measurement of good clustering called disconnectivity as well as existing goodness measurements; and we embed these measures into a meta-learning approach for estimating the number of clusters. A simulation experiment based on 13 representative models and an application to real world datasets are conducted to show the effectiveness of the proposed method.

[1]  Tommy W. S. Chow,et al.  Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..

[2]  I. Burhan Türksen,et al.  MiniMax ε-stable cluster validity index for type-2 fuzziness , 2010, 2010 Annual Meeting of the North American Fuzzy Information Processing Society.

[3]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[4]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[7]  L. Jain,et al.  Fuzzy sets and their application to clustering and training , 2000 .

[8]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[9]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[10]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[11]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Donald C. Wunsch,et al.  A Comparison Study of Validity Indices on Swarm-Intelligence-Based Clustering , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[15]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[16]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[17]  Daoqiang Zhang,et al.  Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation , 2007, Pattern Recognit..

[18]  Shengrui Wang,et al.  FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[19]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[20]  Pasi Fränti,et al.  Minimum spanning tree based split-and-merge: A hierarchical clustering method , 2011, Inf. Sci..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[23]  Jong-Seok Lee,et al.  Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[24]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[25]  Ahmed Albatineh,et al.  MCS: A Method for Finding the Number of Clusters , 2011, J. Classif..

[26]  Jon T. S. Quah,et al.  Real-time credit card fraud detection using computational intelligence , 2008, Expert Syst. Appl..

[27]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[28]  R. J. Kuo,et al.  Integration of self-organizing feature map and K-means algorithm for market segmentation , 2002, Comput. Oper. Res..

[29]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[30]  Eduardo R. Hruschka,et al.  Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[31]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[33]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[34]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[35]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[37]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[38]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[39]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .