Adaptive Initialization Method Based on Spatial Local Information for -Means Algorithm

-means algorithm is a widely used clustering algorithm in data mining and machine learning community. However, the initial guess of cluster centers affects the clustering result seriously, which means that improper initialization cannot lead to a desirous clustering result. How to choose suitable initial centers is an important research issue for -means algorithm. In this paper, we propose an adaptive initialization framework based on spatial local information (AIF-SLI), which takes advantage of local density of data distribution. As it is difficult to estimate density correctly, we develop two approximate estimations: density by -nearest neighborhoods (-NN) and density by -neighborhoods (-Ball), leading to two implements of the proposed framework. Our empirical study on more than 20 datasets shows promising performance of the proposed framework and denotes that it has several advantages: (1) can find the reasonable candidates of initial centers effectively; (2) it can reduce the iterations of -means’ methods significantly; (3) it is robust to outliers; and (4) it is easy to implement.

[1]  Chieh-Yuan Tsai,et al.  Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[2]  C.-C. Jay Kuo,et al.  A new initialization technique for generalized Lloyd iteration , 1994, IEEE Signal Processing Letters.

[3]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[4]  Seiji Yamada,et al.  Careful Seeding Method based on Independent Components Analysis for k-means Clustering , 2012 .

[5]  Stuart A. Roberts,et al.  New methods for the initialisation of clusters , 1996, Pattern Recognit. Lett..

[6]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[7]  Hassan A. Kingravi,et al.  Deterministic Initialization of the k-Means Algorithm using Hierarchical Clustering , 2012, Int. J. Pattern Recognit. Artif. Intell..

[8]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[10]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[11]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Murat Erisoglu,et al.  A new algorithm for initial cluster centers in k-means algorithm , 2011, Pattern Recognit. Lett..

[14]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[17]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[18]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[19]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[20]  Ting Su,et al.  In search of deterministic methods for initializing K-means and Gaussian mixture clustering , 2007, Intell. Data Anal..

[21]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[22]  M. Emre Celebi,et al.  Improving the performance of k-means for color quantization , 2011, Image Vis. Comput..

[23]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.