A comparative study of various clustering techniques on big data sets using Apache Mahout

Clustering algorithms have materialized as an unconventional tool to precisely examine the immense volume of data produced by present applications. In specific, their main objective is to classify data into clusters such that objects are grouped in the same cluster when they are similar rendering to particular metrics and dissimilar to objects of other groups. From the machine learning perspective clustering can be viewed as unsupervised learning of concepts. Hadoop is a distributed file system and an open-source implementation of MapReduce dealing with big data. Apache Mahout clustering algorithms are implemented on top of Hadoop using MapReduce paradigm. In this paper three clustering algorithms are described: K-means, Fuzzy K-Means (FKM) and Canopy clustering implemented by using Apache Mahout as well as providing a comparison. In addition, we underlined the clustering algorithms that are the preeminent performing for big data.