论文信息 - Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min-Heap

Enhanced K-Means Clustering Algorithm Using Red Black Tree and Min-Heap

  Abstract—Fast and high quality clustering is one of the most important tasks in the modern era of information processing wherein people rely heavily on search engines such as Google, Yahoo, and Bing etc. With the huge amount of available data and with an aim to creating better quality clusters, scores of algorithms having quality-complexity trade-offs have been proposed. However, the k-means algorithm proposed during late 1970's still enjoys a respectable position in the list of clustering algorithms. It is considered to be one of the most fundamental algorithms of data mining. It is basically an iterative algorithm. In each iteration, it requires finding the distance between each data object and centroid of each cluster. Considering the hugeness of modern databases, this task in itself snowballs into a tedious task. In this paper, we are proposing an improved version of k-means algorithm which offers to provide a remedy of the aforesaid problem. This algorithm employs two data structures viz. red-black tree and min-heap. These data structures are readily available in the modern programming languages. While red black tree is available in the form of map in C++ and TreeMap in Java, min-heap is available in the form of priority queue in the C++ standard template library. Thus implementation of our algorithm is as simple as that of the traditional algorithm. We have carried out extensive experiments. The results so obtained establish the superiority of our version of k-means algorithm over the traditional one.

[1] A. Bagirov,et al. Modified global k-means algorithm for clustering in gene expression data sets , 2006 .

[2] Nikos A. Vlassis,et al. The global k-means clustering algorithm , 2003, Pattern Recognit..

[3] Maurice K. Wong,et al. Algorithm AS136: A k-means clustering algorithm. , 1979 .

[4] Fang Yuan,et al. A new algorithm to get the initial centroids , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[5] K. alik. An efficient k'-means clustering algorithm , 2008 .

[6] Rudolf Bayer,et al. Organization and maintenance of large ordered indexes , 1972, Acta Informatica.

[7] Hassan Abolhassani,et al. Harmony K-means algorithm for document clustering , 2009, Data Mining and Knowledge Discovery.

[8] Jiawei Han,et al. Data Mining: Concepts and Techniques , 2000 .

[9] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[10] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[11] D.M. Mount,et al. An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[13] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .