论文信息 - A meta-learning approach for determining the number of clusters with consideration of nearest neighbors

A meta-learning approach for determining the number of clusters with consideration of nearest neighbors

An important and challenging problem in data clustering is the determination of the best number of clusters. A variety of estimation methods has been proposed over the years to address this problem. Most of these methods depend on several nontrivial assumptions about the data structure; and such methods may thus fail to discover the true clusters in a dataset that does not satisfy those assumptions. We develop a new approach that takes as a starting point the simple and intuitive observation that close objects should fall within the same cluster, whereas distant ones should not. Based on this simple notion we utilize a new measurement of good clustering called disconnectivity as well as existing goodness measurements; and we embed these measures into a meta-learning approach for estimating the number of clusters. A simulation experiment based on 13 representative models and an application to real world datasets are conducted to show the effectiveness of the proposed method.

Jong-Seok Lee | Sigurdur Ólafsson | S. Ólafsson | Jong-Seok Lee

[1] Tommy W. S. Chow,et al. Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density , 2004, Pattern Recognit..

[2] I. Burhan Türksen,et al. MiniMax ε-stable cluster validity index for type-2 fuzziness , 2010, 2010 Annual Meeting of the North American Fuzzy Information Processing Society.

[3] Joachim M. Buhmann,et al. Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[4] V. J. Rayward-Smith,et al. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[5] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6] Isabelle Guyon,et al. A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[7] L. Jain,et al. Fuzzy sets and their application to clustering and training , 2000 .

[8] Catherine A. Sugar,et al. Finding the Number of Clusters in a Dataset , 2003 .

[9] Ana L. N. Fred,et al. Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[10] Eytan Domany,et al. Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[11] Chris H. Q. Ding,et al. K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[12] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[13] Donald C. Wunsch,et al. A Comparison Study of Validity Indices on Swarm-Intelligence-Based Clustering , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[15] Robert Tibshirani,et al. Cluster Validation by Prediction Strength , 2005 .

[16] Ricardo J. G. B. Campello,et al. A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[17] Daoqiang Zhang,et al. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation , 2007, Pattern Recognit..

[18] Shengrui Wang,et al. FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[19] W. Krzanowski,et al. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[20] Pasi Fränti,et al. Minimum spanning tree based split-and-merge: A hierarchical clustering method , 2011, Inf. Sci..

[21] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22] C. Mallows,et al. A Method for Comparing Two Hierarchical Clusterings , 1983 .

[23] Jong-Seok Lee,et al. Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[24] James C. Bezdek,et al. Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[25] Ahmed Albatineh,et al. MCS: A Method for Finding the Number of Clusters , 2011, J. Classif..

[26] Jon T. S. Quah,et al. Real-time credit card fraud detection using computational intelligence , 2008, Expert Syst. Appl..

[27] Peter J. Rousseeuw,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[28] R. J. Kuo,et al. Integration of self-organizing feature map and K-means algorithm for market segmentation , 2002, Comput. Oper. Res..

[29] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[30] Eduardo R. Hruschka,et al. Towards improving cluster-based feature selection with a simplified silhouette filter , 2011, Inf. Sci..

[31] Jiri Matas,et al. On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32] S. Dudoit,et al. A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[33] Ching Y. Suen,et al. Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[34] Nicu Sebe,et al. Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[35] Anil K. Jain,et al. Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36] T. Caliński,et al. A dendrite method for cluster analysis , 1974 .

[37] John A. Hartigan,et al. Clustering Algorithms , 1975 .

[38] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[39] G. W. Milligan,et al. An examination of procedures for determining the number of clusters in a data set , 1985 .