Analysis of Mahout Big Data Clustering Algorithms

Log data generated from any of the source or communicating devices is huge; to analyze such data we need to categorize them in some clusters. Depending upon clusters, data analytics can be done. Enabling the analytics in data helps in identification of business patterns and behavior of customers. Analyzing such big data is a major task, so distributed computing is used in Hadoop platform and machine learning library Mahout is used. Weighting technique TF-IDF is used for vectorization of data, and clusters are formed using clustering algorithms for doing analysis. Clustering algorithms K-mean, fuzzy K-Mean, LDA, and spectral clustering in Mahout are used and analyzed on basis of execution time, number of clusters, static or dynamic cluster creation.

[1]  Chunming Rong,et al.  K-means Clustering in the Cloud -- A Mahout Test , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[2]  Ishan Sharma,et al.  Open Source Big Data Analytics Technique , 2017 .

[3]  Gabriel Antoniu,et al.  Governing energy consumption in Hadoop through CPU frequency scaling: An analysis , 2016, Future Gener. Comput. Syst..

[4]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[5]  Aisling O'Driscoll,et al.  A big data methodology for categorising technical support requests using Hadoop and Mahout , 2013, Journal Of Big Data.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.