A MapReduce-based artificial bee colony for large-scale data clustering

Abstract The progress of technology has been a significant factor in increasing the growth of digital data. Therefore, good data analysis is a necessity for making better decisions. Clustering is one of the most important elements in the field of data analysis. However, the clustering of very large datasets is considered a primary concern. The improvement of computational models along with the ability to cluster huge volumes of data within a reasonable amount of time is thus required. MapReduce is a powerful programming model and an associated implement for processing large datasets with a parallel, distributed algorithm in a computing cluster. In this paper, a MapReduce-based artificial bee colony called MR-ABC is proposed for data clustering. The ABC is implemented based on the MapReduce model in the Hadoop framework and utilized to optimize the assignment of the large data instances to clusters with the objective of minimizing the sum of the squared Euclidean distance between each data instance and the centroid of the cluster to which it belongs. The experimental results demonstrate that our proposed algorithm is well-suited for dealing with massive amounts of data, while the quality level of the clustering results is still maintained.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Tiranee Achalakul,et al.  The best-so-far ABC with multiple patrilines for clustering problems , 2013, Neurocomputing.

[3]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Dervis Karaboga,et al.  A comprehensive survey: artificial bee colony (ABC) algorithm and applications , 2012, Artificial Intelligence Review.

[5]  Kaushik Roy,et al.  The k-Nearest Neighbor Algorithm Using MapReduce Paradigm , 2014, 2014 5th International Conference on Intelligent Systems, Modelling and Simulation.

[6]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[7]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[8]  B. Kulkarni,et al.  An ant colony approach for clustering , 2004 .

[9]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[10]  Tunchan Cura,et al.  A particle swarm optimization approach to clustering , 2012, Expert Syst. Appl..

[11]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[12]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[13]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[14]  Ibrahim Aljarah,et al.  Parallel particle swarm optimization clustering algorithm based on MapReduce methodology , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Tiranee Achalakul,et al.  Method for failure pattern analysis in disk drive manufacturing , 2011, Int. J. Comput. Integr. Manuf..

[18]  Xavier Llorà,et al.  Scaling Genetic Algorithms Using MapReduce , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.