Enhanced k-means Clustering Algorithm

Data clustering is an unsupervised classification method aims at creating groups of objects, or clusters, in such a way that objects in the same cluster are very similar and objects in different clusters are quite distinct. Though k-means is very popular for general clustering, it suffers from some disadvantages such as (1) Its performance depends highly on initial cluster centers, (2) The number of clusters must be previously known and fixed, and (3) The algorithm contains the dead-unit problem which results in empty clusters. Random k-means initialization generally leads k-means to converge to local minima i.e. inacceptable clustering results are produced. In this thesis a method based on some rough set theory concepts and reverse nearest neighbor search is proposed to find the appropriate initial centers for the k-means clustering problem. The complexity of the proposed method is analyzed as well. Also, a method is described to determine the number of clusters in a dataset. Experimental results show the accuracy and effectiveness of the proposed methods. iii k-means k-means rough set theory cohesion degree RNN degree k-means iv Dedication To whom I love v Acknowledgment My thanks to all those who generously contributed their favorite recipes. Without their help, this work would have never been possible.

[1]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Jiye Liang,et al.  An initialization method for the K-Means algorithm using neighborhood model , 2009, Comput. Math. Appl..

[5]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[6]  Douglas H. Fisher,et al.  Improving Inference through Conceptual Clustering , 1987, AAAI.

[7]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[8]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[9]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[10]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[11]  Jing-Yu Yang,et al.  Hierarchical initialization approach for K-Means clustering , 2008, Pattern Recognit. Lett..

[12]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[13]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[14]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  Ji Hyea Han,et al.  Data Mining : Concepts and Techniques 2 nd Edition Solution Manual , 2005 .

[17]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[18]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[19]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[20]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[21]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[22]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[23]  Baowen Xu,et al.  Stable initialization scheme for K-means clustering , 2009, Wuhan University Journal of Natural Sciences.

[24]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.