论文信息 - Mining on the Cloud - K-means with MapReduce

Mining on the Cloud - K-means with MapReduce

The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. The huge collections of raw data require fast and accurate mining process in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this paper, we developed a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique proved its efficiency.

M. Tahar Kechadi | Ilias K. Savvas

[1] Geoffrey C. Fox,et al. DryadLINQ for Scientific Analyses , 2009, 2009 Fifth IEEE International Conference on e-Science.

[2] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[3] Hai Jin,et al. Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[4] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[5] Philip A. Pinto,et al. The Large Synoptic Survey Telescope , 2006 .

[6] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7] Wei Jiang,et al. Comparing map-reduce and FREERIDE for data-intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[11] Alexander S. Szalay,et al. Data-Intensive Computing in the 21st Century , 2008, Computer.

[12] Geoffrey C. Fox,et al. Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[13] J. Cordes. The Square Kilometer Array , 2006 .

[14] Ruoming Jin,et al. Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.