论文信息 - Improved Map Reduce K Mean Clustering Algorithm for Hadoop Architecture

Improved Map Reduce K Mean Clustering Algorithm for Hadoop Architecture

Cluster is a gathering of information individuals having comparable qualities. The procedure of setting up a connection or getting data from crude information by performing a few operations on the information set like grouping is known as information mining. Information gathered in reasonable situations is usually totally arbitrary and unstructured. Consequently, there is dependably a requirement for examination of unstructured information sets to determine important data. This is the place unsupervised calculations come into picture to prepare unstructured or even semi organized information sets by resultant. K-Means Clustering is one such method used to give a structure to unstructured information so that significant data can be separated. Discusses the implementation of the K-Means Clustering Algorithm over a distributed environment using Apache Hadoop. The key to the implementation of the KMeans Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described and this based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations.

Vivek Badhe | Shweta Mishra

[1] Hai Jin,et al. Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[2] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[3] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[4] Wei Jiang,et al. Comparing map-reduce and FREERIDE for data-intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6] Geoffrey C. Fox,et al. Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[7] Geoffrey C. Fox,et al. DryadLINQ for Scientific Analyses , 2009, 2009 Fifth IEEE International Conference on e-Science.

[8] Ruoming Jin,et al. Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.