Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

[1]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[2]  Min Chen Soft Clustering for Very Large Data Sets , 2017 .

[3]  Chun-Wei Tsai,et al.  Parallel Black Hole Clustering Based on MapReduce , 2015, 2015 IEEE International Conference on Systems, Man, and Cybernetics.

[4]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[5]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[8]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[9]  Allen D. Malony,et al.  Scaling Spark on HPC Systems , 2016, HPDC.

[10]  Thomas S. Huang,et al.  Supporting Ranked Boolean Similarity Queries in MARS , 1998, IEEE Trans. Knowl. Data Eng..

[11]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[12]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[13]  Deepak Vohra,et al.  Practical Hadoop Ecosystem , 2016, Apress.

[14]  C. Sreedhar,et al.  A Survey on Big Data Management and Job Scheduling , 2015 .

[15]  Madhu Siddalingaiah,et al.  Pro Apache Hadoop , 2014, Apress.

[16]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[17]  Yang Liu,et al.  EST Clustering in Large Dataset with MapReduce , 2010, 2010 First International Conference on Pervasive Computing, Signal Processing and Applications.

[18]  Duc Truong Pham,et al.  A two-phase K-means algorithm for large datasets , 2004 .

[19]  Saket Kumar,et al.  Transaction support for HBase , 2014, COMAD.

[20]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1996, IEEE Trans. Neural Networks.

[21]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[22]  Wei Fang,et al.  Meteorological Data Analysis Using MapReduce , 2014, TheScientificWorldJournal.

[23]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[24]  Yan Yang,et al.  A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework , 2011 .

[25]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[26]  Shiping Wang,et al.  [A new method for EST clustering]. , 2003, Yi chuan xue bao = Acta genetica Sinica.

[27]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[28]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[29]  R. J. Shanley,et al.  Delineation and analysis of clusters in orientation data , 1976 .

[30]  C. Sreedhar,et al.  A Novel Multilevel Queue based Performance Analysis of Hadoop Job Schedulers , 2016 .

[31]  Heikki Mannila,et al.  Similarity of Attributes by External Probes , 1998, KDD.

[32]  Zili Zhang,et al.  A distributed spatial-temporal weighted model on MapReduce for short-term traffic flow forecasting , 2016, Neurocomputing.

[33]  Pedro M. Domingos Linear-Time Rule Induction , 1996, KDD.

[34]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[35]  Fabio Kon,et al.  InteGrade : a Tool for Executing Parallel Applications on a Grid for Opportunistic Computing ∗ , 2005 .

[36]  Jerry B. Weinberg,et al.  ITERATE: A Conceptual Clustering Method for Knowledge Discovery in Databases , 1994 .

[37]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[38]  Madhu Siddalingaiah,et al.  HCatalog and Hadoop in the Enterprise , 2014 .