Research on Sampling Method of CFSFDP Clustering Algorithm and Its Criteria for Determining the Best Sample Size

Clustering by fast search and find of density peaks (CFSFDP) is a novel density-based fast clustering method, which has been widely studied and applied in many fields. However, when the sample size of data is too large, the algorithm is inefficient, since it consumes a lot of time and storage space. To solve the above problem, a simple random sampling (SRS) method is provided to speed up the optimized CFSFDP algorithm for real data with large sample size. The rate of correct classification of the sample is defined to measure its clustering performance, and we call it as sampling accuracy. We first use SRS method to generate small samples for cluster analysis. Then, we explore the relationship between the sampling rate and the sampling accuracy. Finally, in order to determine the best sample size that can achieve high sampling accuracy with high efficiency, the mean and standard deviation of the sampling accuracy are adopted as two criteria, and the best sample size is determined based on them. A real case study is given to show the implementation and effectiveness of the proposed method.

[1]  Ping He,et al.  Manifold Density Peaks Clustering Algorithm , 2015, 2015 Third International Conference on Advanced Cloud and Big Data.

[2]  Jing Li,et al.  Extended fast search clustering algorithm: widely density clusters, no density peaks , 2015, ArXiv.

[3]  Yang Li,et al.  Co-spectral clustering based density peak , 2015, 2015 IEEE 16th International Conference on Communication Technology (ICCT).

[4]  Yunchuan Sun,et al.  Adaptive fuzzy clustering by fast search and find of density peaks , 2015, 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI).

[5]  Yingying Liu,et al.  A Text Clustering Algorithm Based on Find of Density Peaks , 2015, 2015 7th International Conference on Information Technology in Medicine and Education (ITME).

[6]  Rongfang Bie,et al.  Fuzzy Clustering by Fast Search and Find of Density Peaks , 2015, IIKI.

[7]  Xiaofeng Zhou,et al.  An efficient clustering method for medical data applications , 2015, 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER).

[8]  Chang-Dong Wang,et al.  HDenDist: Nonlinear Hierarchical Clustering Based on Density and Min-distance , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[9]  Han Qi,et al.  A new method to estimate ages of facial image for large database , 2015, Multimedia Tools and Applications.

[10]  Kang Sun,et al.  Exemplar Component Analysis: A Fast Band Selection Method for Hyperspectral Imagery , 2015, IEEE Geoscience and Remote Sensing Letters.

[11]  Mengmeng Wang,et al.  An improved density peaks-based clustering method for social circle discovery in social networks , 2016, Neurocomputing.

[12]  Chang-Dong Wang,et al.  SDenPeak: Semi-supervised Nonlinear Clustering Based on Density and Distance , 2016, 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService).

[13]  Hongjie Jia,et al.  Study on density peaks clustering based on k-nearest neighbors and principal component analysis , 2016, Knowl. Based Syst..

[14]  Xueying Zhang,et al.  Robust support vector data description for outlier detection with noise or uncertain data , 2015, Knowl. Based Syst..

[15]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[16]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.