Single-pass and linear-time k-means clustering based on MapReduce

In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means are O ( log 2 k ) - competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants. Highlightsmrk-means is a novel clustering algorithm which is based on MapReduce.mrk-means is single-pass and linear-time.mrk-means results in clusters that are O ( log 2 k ) - competitive to optimal solution.mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.

[1]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[2]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[3]  George Kesidis,et al.  Emergent unsupervised clustering paradigms with potential application to bioinformatics. , 2008, Frontiers in bioscience : a journal and virtual library.

[4]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[5]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[6]  Saeed Jalili,et al.  Fast data-oriented microaggregation algorithm for large numerical datasets , 2014, Knowl. Based Syst..

[7]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[8]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[10]  Saeed Shahrivari,et al.  High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[11]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[12]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[18]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[19]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[20]  Pairote Sattayatham,et al.  Weighted K-Means for Density-Biased Clustering , 2005, DaWaK.

[21]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[22]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[23]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[24]  Jiali Mao,et al.  The Study of Parallel K-Means Algorithm , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[25]  Bo Li,et al.  Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[26]  Kilian Stoffel,et al.  Parallel k/h-Means Clustering for Large Data Sets , 1999, Euro-Par.

[27]  Jeffrey D. Ullman,et al.  Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[28]  ZhangHui,et al.  Image segmentation evaluation , 2008 .

[29]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[30]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[31]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[32]  Carlos Ordonez,et al.  Clustering binary data streams with K-means , 2003, DMKD '03.

[33]  Hui Zhang,et al.  Image segmentation evaluation: A survey of unsupervised methods , 2008, Comput. Vis. Image Underst..