论文信息 - Single-pass and linear-time k-means clustering based on MapReduce

Single-pass and linear-time k-means clustering based on MapReduce

In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means are O ( log 2 k ) - competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants. Highlightsmrk-means is a novel clustering algorithm which is based on MapReduce.mrk-means is single-pass and linear-time.mrk-means results in clusters that are O ( log 2 k ) - competitive to optimal solution.mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.

Saeed Jalili | Saeed Shahrivari | S. Jalili | Saeed Shahrivari

[1] Sudipto Guha,et al. Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[2] Anil K. Jain. Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[3] George Kesidis,et al. Emergent unsupervised clustering paradigms with potential application to bioinformatics. , 2008, Frontiers in bioscience : a journal and virtual library.

[4] Sanjay Ghemawat,et al. MapReduce: a flexible data processing tool , 2010, CACM.

[5] Michael D. Ernst,et al. HaLoop , 2010, Proc. VLDB Endow..

[6] Saeed Jalili,et al. Fast data-oriented microaggregation algorithm for large numerical datasets , 2014, Knowl. Based Syst..

[7] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[8] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[9] H. Edelsbrunner,et al. Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[10] Saeed Shahrivari,et al. High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[11] Athanasios V. Vasilakos,et al. Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[12] C. Lynch. Big data: How do your data grow? , 2008, Nature.

[13] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] Hillol Kargupta,et al. Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17] Joseph M. Hellerstein,et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[18] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.

[19] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[20] Pairote Sattayatham,et al. Weighted K-Means for Density-Biased Clustering , 2005, DaWaK.

[21] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[22] Geoffrey C. Fox,et al. Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[23] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[24] Jiali Mao,et al. The Study of Parallel K-Means Algorithm , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[25] Bo Li,et al. Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[26] Kilian Stoffel,et al. Parallel k/h-Means Clustering for Large Data Sets , 1999, Euro-Par.

[27] Jeffrey D. Ullman,et al. Map-reduce extensions and recursive queries , 2011, EDBT/ICDT '11.

[28] ZhangHui,et al. Image segmentation evaluation , 2008 .

[29] Ruoming Jin,et al. Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[30] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[31] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[32] Carlos Ordonez,et al. Clustering binary data streams with K-means , 2003, DMKD '03.

[33] Hui Zhang,et al. Image segmentation evaluation: A survey of unsupervised methods , 2008, Comput. Vis. Image Underst..