Improved K-means algorithm based on density Canopy

Abstract In order to improve the accuracy and stability of K-means algorithm and solve the problem of determining the most appropriate number K of clusters and best initial seeds, an improved K-means algorithm based on density Canopy is proposed. Firstly, the density of sample data sets, the average sample distance in clusters and the distance between clusters are calculated, choosing the density maximum sampling point as the first cluster center and removing the density cluster from the data sets. Defining the product of sample density, the reciprocal of the average distance between the samples in the cluster, and the distance between the clusters as weight product, the other initial seeds is determined by the maximum weight product in the remaining data sets until the data sets is empty. The density Canopy is used as the preprocessing procedure of K-means and its result is used as the cluster number and initial clustering center of K-means algorithm. Finally, the new algorithm is tested on some well-known data sets from UCI machine learning repository and on some simulated data sets with different proportions of noise samples. The simulation results show that the improved K-means algorithm based on density Canopy achieves better clustering results and is insensitive to noisy data compared to the traditional K-means algorithm, the Canopy-based K-means algorithm, Semi-supervised K-means++ algorithm and K-means-u* algorithm. The clustering accuracy of the proposed K-means algorithm based on density Canopy is improved by 30.7%, 6.1%, 5.3% and 3.7% on average on UCI data sets, and improved by 44.3%, 3.6%, 9.6% and 8.9% on the simulated data sets with noise signal respectively. With the increase of the noise ratio, the noise immunity of the new algorithm is more obvious, when the noise ratio reached 30%, the accuracy rate is improved 50% and 6% compared to the traditional K-means algorithm and the Canopy-based K-means algorithm.

[1]  Jiang Ping,et al.  Application of Associated Clustering and Classification Method in Electric Power Load Forecasting , 2012 .

[2]  R. M. Chandrasekaran,et al.  A sampling based sentiment mining approach for e-commerce applications , 2017, Inf. Process. Manag..

[3]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[4]  Ali Al-Wakeel,et al.  Low carbon cities and urban energy systems K-means based cluster analysis of residential smart meter measurements , 2016 .

[5]  Leslie Monplaisir,et al.  Modeling of fuzzy-based voice of customer for business decision analytics , 2017, Knowl. Based Syst..

[6]  Syed Zishan Ali,et al.  A novel method for clustering using k-means and Apriori algorithm , 2016, 2016 2nd International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB).

[7]  Zhang Ze-bao Algorithm for Initialization of K-Means Clustering Center Based on Optimized-Division , 2009 .

[8]  Lin Fan,et al.  An Efficient Clustering Algorithm Based on Local Optimality of K -Means: An Efficient Clustering Algorithm Based on Local Optimality of K -Means , 2008 .

[9]  Jianzhong Wu,et al.  k-means based load estimation of domestic smart meter measurements , 2017 .

[10]  Weixin Xie,et al.  An Efficient Global K-means Clustering Algorithm , 2011, J. Comput..

[11]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[12]  Tang Xu-qing New method for determining optimal number of clusters in K-means clustering algorithm , 2010 .

[13]  Carey E. Priebe,et al.  Semi-supervised k-means++ , 2016, 1602.00360.

[14]  Guan Zhong-ren Research of clustering algorithm based on K-means , 2009 .

[15]  Michael Conlon,et al.  A clustering approach to domestic electricity load profile characterisation using smart metering data , 2015 .

[16]  Sun Ji,et al.  Clustering Algorithms Research , 2008 .

[17]  Yi Wang,et al.  Clustering of Electricity Consumption Behavior Dynamics Toward Big Data Applications , 2016, IEEE Transactions on Smart Grid.

[18]  Aristidis Likas,et al.  The MinMax k-Means clustering algorithm , 2014, Pattern Recognit..

[19]  Mao Dianhui Improved Canopy-Kmeans algorithm based on MapReduce , 2012 .

[20]  Àngela Nebot,et al.  Intelligent data analysis approaches to churn as a business problem: a survey , 2017, Knowledge and Information Systems.