A Parallel K-Medoids Algorithm for Clustering based on MapReduce

One of the most important machine learning techniques include clustering of data into different clusters or categories. There are several decent algorithms and techniques that exist to perform clustering on small to medium scale data. In the era of Big Data and with applications being large-scale and data-intensive in nature, there is a significant increment in volume, variety and velocity of data in the form of log events produced by such applications. This makes the task of clustering of huge amounts of data more challenging and limited. In this paper, we present our approach of a parallel K-Medoids clustering algorithm based on MapReduce paradigm to be able to perform clustering on large-scale of data. We have kept our solution simple and feasible to be used to handle huge volume, variety and velocity of data. Another key uniqueness in our proposed algorithm is that it can achieve parallelism independent of the number of k clusters to be formed, unlike other related approaches. We have tested our algorithm on large amounts of data and on a real-life case-study.

[1]  Vladimir Gorodetsky,et al.  Big Data: Opportunities, Challenges and Solutions , 2014, ICTERI.

[2]  Reda Alhajj,et al.  Reducing search space for Web Service ranking using semantic logs and Semantic FP-Tree based association rule mining , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[3]  Ayesha Afsana Information and communication technologies in education , 2013 .

[4]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[5]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[7]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[8]  Reda Alhajj,et al.  Reducing problem space using Bayesian classification on semantic logs for enhanced application monitoring and management , 2014, 2014 IEEE 13th International Conference on Cognitive Informatics and Cognitive Computing.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[11]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[12]  Reda Alhajj,et al.  Handling incomplete data using semantic logging based Social Network Analysis Hexagon for effective application monitoring and management , 2014, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014).

[13]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[14]  Sebastián Lozano,et al.  Parallel Fuzzy c-Means Clustering for Large Data Sets , 2002, Euro-Par.

[15]  Feiping Nie,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Multi-View K-Means Clustering on Big Data , 2022 .

[16]  Reda Alhajj,et al.  Log based business process engineering using fuzzy web service discovery , 2014, Knowl. Based Syst..

[17]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[18]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[19]  Bo Li,et al.  Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[20]  Magdalena Balazinska,et al.  Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster , 2010, SSDBM.

[21]  Xindong Wu,et al.  K-Means Clustering with Bagging and MapReduce , 2011, 2011 44th Hawaii International Conference on System Sciences.