An Improved K-means Algorithm based on Mapreduce and Grid

The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.

[1]  Hiroshi Murase,et al.  Power-Efficient Hardware Architecture of K-Means Clustering With Bayesian-Information-Criterion Processor for Multimedia Processing Applications , 2011, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[2]  David G. Stork,et al.  Pattern Classification , 1973 .

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[5]  Sohail Asghar,et al.  Critical analysis of DBSCAN variations , 2010, 2010 International Conference on Information and Emerging Technologies.

[6]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[7]  G. Sudha Sadasivam,et al.  A novel parallel hybrid K-means-DE-ACO clustering approach for genomic clustering using MapReduce , 2011, 2011 World Congress on Information and Communication Technologies.

[8]  Man Lan,et al.  Initialization of cluster refinement algorithms: a review and comparative study , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Christos Boutsidis,et al.  Deterministic Feature Selection for K-Means Clustering , 2011, IEEE Transactions on Information Theory.

[11]  Qing Liao,et al.  An improved parallel K-means clustering algorithm with MapReduce , 2013, 2013 15th IEEE International Conference on Communication Technology.

[12]  Nor Ashidi Mat Isa,et al.  Adaptive fuzzy-K-means clustering algorithm for image segmentation , 2010, IEEE Transactions on Consumer Electronics.

[13]  Songul Albayrak,et al.  Recursive-Partitioned DBSCAN , 2010, 2010 IEEE 18th Signal Processing and Communications Applications Conference.

[14]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[15]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[16]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[17]  Amresh Kumar,et al.  Verification and validation of MapReduce program model for parallel K-means algorithm on Hadoop cluster , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[18]  Johan A. K. Suykens,et al.  Optimized Data Fusion for Kernel k-Means Clustering , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Y.-M. Yu,et al.  Recognition of various tactile stimuli using independent component analysis and k-means , 2010 .

[20]  Foreword and Editorial International Journal of Grid Distribution Computing , .

[21]  O. Debande,et al.  Information and Communication Technologies: A Tool Empowering and Developing the Horizon of the Learner. , 2004 .

[22]  T.S. Perry,et al.  Consumer electronics , 1990, IEEE Spectrum.