A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Hadoop is a distributed file system and an opensource implementation of MapReduce dealing with big data. Facebook, Yahoo, Google etc. makes use of Hadoop to process more than 15 terabytes of new data per day. Data clustering is a part of machine learning and it has high applicability in industries and also in various fields such as image processing, recommendation systems, text analytics etc. In this paper three clustering algorithms are described – K-mean, canopy clustering and Fuzzy K-mean clustering implemented by both MapReduce and sequential approach. MapReduce paradigms are able to solve many problems related to big volume of data by modelling algorithms in map and reduce strategy and high volume data can't be fit into memory for clustering. This is the reason as to why MapReduce paradigms have gained popularity for clustering in big data. These algorithms work with data in portioned form and need to consider the distributed nature of portioned data and model the algorithm accordingly. Apache Mahout, an open source implementation is used in various organizations and was developed by a team of active contributors.