Mining on the Cloud - K-means with MapReduce

The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. The huge collections of raw data require fast and accurate mining process in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this paper, we developed a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique proved its efficiency.