Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm

Clustering is an unsupervised categorization technique and also a highly used operation in data mining, in which, the data sets are divided into certain clusters according to similarity or dissimilarity criterions so that the assigned objects to each cluster would be more similar to each other comparing to the objects of other clusters. The k-means algorithm is one of the most well-known algorithms in clustering that is used in various models of data mining. The k-means categorizes a set of objects into certain number of clusters. One of the most important problems of this algorithm occurs when encountering to outliers. The outliers in the data set lead to getting away from the real cluster centers and consequently a reduction in the clustering algorithm accuracy. In this paper, we separate outliers from normal objects using a mechanism based on dissimilarity of objects. Then, the normal objects are clustered using kmeans algorithm process and finally, the outliers are assigned to the closest cluster. The experimental results show the accuracy and efficiency of the proposed method.

[1]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[2]  Cheng-Fa Tsai,et al.  A new data clustering approach for data mining in large databases , 2002, Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN'02.

[3]  Andrew K. C. Wong,et al.  Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Yucheng Kao,et al.  Combining K-means and particle swarm optimization for dynamic data clustering problems , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[5]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[6]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[7]  Zhen Yang,et al.  A Location-Aware-Based Data Clustering algorithm in Wireless Sensor Networks , 2008, 2008 11th IEEE Singapore International Conference on Communication Systems.

[8]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ming-jian Zhou,et al.  An Outlier Mining Algorithm Based on Dissimilarity , 2012 .