Automatic cluster number selection by finding density peaks

Clustering is one of the most fundamental techniques in data mining. Although various algorithms have been proposed to solve clustering problem, the main difficulty left is the determination of optimal number for clusters. There are many algorithms trying to automatically select reasonable cluster number, such as Silhouette, Gap-test, Akiake Information Criterion, and Bayes Information Criterion. However, these approaches are limited to the model-based nature of the process, which is not suitable for most unsupervised learning situations. In this paper, we propose an efficient automatic method to choose the optimal number of clusters based on finding density peaks, and our method has lower computional complexity cause we do not use various clustering results to optimize clustering number. The experimental results show it performs better than the existing methods in six famous data sets. Furthermore, based on automatically choosing the optimal number of clusters, an automatic cluster algorithm is proposed. Compared with other algorithms, our algorithm, without any manual parameter specified in advance, is more valid to discover the structure of data sets.

[1]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[2]  Tzong-Jer Chen,et al.  Fuzzy c-means clustering with spatial information for image segmentation , 2006, Comput. Medical Imaging Graph..

[3]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[4]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Philip Chan,et al.  Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[7]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[8]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[9]  Qiang Chen,et al.  Fuzzy c-means clustering with weighted image patch for image segmentation , 2012, Appl. Soft Comput..

[10]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Alexandre Galvão Patriota,et al.  A non-parametric method to estimate the number of clusters , 2014, Comput. Stat. Data Anal..

[12]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[13]  Kevin Lewis,et al.  Social selection and peer influence in an online social network , 2011, Proceedings of the National Academy of Sciences.

[14]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[15]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[16]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[18]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[19]  S. Stigler Gauss and the Invention of Least Squares , 1981 .

[20]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[21]  J. Steinier,et al.  Smoothing and differentiation of data by simplified least square procedure. , 1972, Analytical chemistry.

[22]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .