Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering

Abstract The volume of datasets is increasing in a very fast rate due to the expansion of digitalization of each file of work. The traditional clustering algorithm becomes ineffective in analyzing such huge volume of datasets as it requires large time to cluster such huge volume of datasets. The parallel and distributed architectures are designed to process such large datasets. In order to obtain efficiency in clustering job, traditional clustering algorithms are required to be designed for such parallel and distributed architectures. Few parallel clustering algorithms are designed for gaining efficiency in clustering which works on datasets which are loaded and accessed from main memory, which in turn develops a limitation in clustering large datasets that cannot load millions of data objects in memory at once. In this work, we have proposed a parallel version of traditional K-means so as to execute it over Hadoop distributed framework. The experimental results show that our proposed K-means algorithm outperforms traditional K-means while clustering large volume of datasets.

[1]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[2]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[3]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .

[4]  Anupama Chadha Efficient Clustering Algorithms in Educational Data Mining , 2018 .

[5]  Awais Ahmad,et al.  An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication , 2016, Neurocomputing.

[6]  Ping Zhou,et al.  Large-Scale Data Sets Clustering Based on MapReduce and Hadoop , 2011 .

[7]  Kurt Keutzer,et al.  A map reduce framework for programming graphics processors , 2010 .

[8]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[9]  S. B. Bagal,et al.  Performance Evaluation of K-means Clustering Algorithm with Various Distance Metrics , 2015 .

[10]  Ammar Ismael Kadhim,et al.  Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering , 2014, 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology.

[11]  Jeffrey Dean,et al.  Keynote talk: Experiences with MapReduce, an abstraction for large-scale computation , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Bipul Syam Purkayastha,et al.  An Approach for Document Pre-processing and K Means Algorithm Implementation , 2014, 2014 Fourth International Conference on Advances in Computing and Communications.

[13]  Tanvir Habib Sardar,et al.  An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means , 2017, 2017 IEEE International Conference on Circuits and Systems (ICCS).

[14]  Ahmed Rimaz Faizabadi,et al.  An evaluation of MapReduce framework in cluster analysis , 2017, 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT).

[15]  Tanvir Habib Sardar,et al.  Detection and confirmation of web robot requests for cleaning the voluminous web log data , 2014, 2014 International Conference on the IMpact of E-Technology on US (IMPETUS).

[16]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[17]  Syed Abdul Sattar,et al.  A fuzzy neural network based framework to discover user access patterns from web log data , 2017, Adv. Data Anal. Classif..

[18]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[19]  E. George Dharma Prakash Raj,et al.  SBKMMA: Sorting Based K Means and Median Based Clustering Algorithm Using Multi Machine Technique for Big Data , 2018 .

[20]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[21]  Syed Abdul Sattar,et al.  Mountain density-based fuzzy approach for discovering web usage clusters from web log data , 2015, Fuzzy Sets Syst..

[22]  A. Vinaya Babu,et al.  A Fuzzy Clustering Based Approach for Mining Usage Profiles from Web Log Data , 2015, ArXiv.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.