论文信息 - A comparative study of various clustering techniques on big data sets using Apache Mahout

A comparative study of various clustering techniques on big data sets using Apache Mahout

Clustering algorithms have materialized as an unconventional tool to precisely examine the immense volume of data produced by present applications. In specific, their main objective is to classify data into clusters such that objects are grouped in the same cluster when they are similar rendering to particular metrics and dissimilar to objects of other groups. From the machine learning perspective clustering can be viewed as unsupervised learning of concepts. Hadoop is a distributed file system and an open-source implementation of MapReduce dealing with big data. Apache Mahout clustering algorithms are implemented on top of Hadoop using MapReduce paradigm. In this paper three clustering algorithms are described: K-means, Fuzzy K-Means (FKM) and Canopy clustering implemented by using Apache Mahout as well as providing a comparison. In addition, we underlined the clustering algorithms that are the preeminent performing for big data.

[1] L. Pacheco,et al. Improving Clustering Algorithms for Image Segmentation using Contour and Region Information , 2006, 2006 IEEE International Conference on Automation, Quality and Testing, Robotics.

[2] Venkateswara Reddy Eluri,et al. A Comparative Study of Color Image Segmentation Using Hard, Fuzzy,Rough Set Based Clustering Techniques , 2013, BIOINFORMATICS 2013.

[3] Zahir Tari,et al. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[4] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[5] Philip S. Yu,et al. Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[6] P. Dhanalakshmi,et al. Automatic Segmentation of Brain Tumor using K-Means Clustering and its Area Calculation , 2013 .

[7] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8] Mohamed Zaït,et al. A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..