The BigKClustering approach for document clustering using Hadoop MapReduce

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. K-Means is one of the most commonly used methods of clustering, due to its high quality results and low time cost. However, using the K-Means algorithm in document clustering over large-scale collections can lead to unpredictable time costs, since duration of a K-Means iteration tends to grow as the number of iterations grows. In this paper we first present some of the most promising alternatives for document clustering over such 'big data' (large-scale) collections. We also present our variation of an existing K-Means-based algorithm, known as BigKClustering (BKC) so that it can be applied in document clustering. The proposed adjustment of BKC is then implemented using Hadoop MapReduce to handle big (text) data collections efficiently and experimentally tested over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as significant execution time improvements (compared to K-Means), thus constituting a promising clustering approach for big document collections.

[1]  Rudan Luo,et al.  International Journal on Recent and Innovation Trends in Computing and Communication , 2017 .

[2]  Parth Gohil,et al.  Modified Fuzzy K-mean clustering using MapReduce in Hadoop and cloud , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[3]  Nadeem Akthar,et al.  MapReduce Model of Improved K-Means Clustering Algorithm Using Hadoop MapReduce , 2016, 2016 Second International Conference on Computational Intelligence & Communication Technology (CICT).

[4]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[5]  Hoda M. O. Mokhtar,et al.  Ontology Based Document Clustering Using MapReduce , 2015, ArXiv.

[6]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[7]  Alok N. Choudhary,et al.  Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce , 2015, SpringSim.

[8]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[9]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[10]  Yasmine Lamari,et al.  Parallel Document Clustering using Iterative MapReduce , 2016, BDAW '16.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Qi Hua,et al.  Parallelizing K-Means-Based Clustering on Spark , 2016, 2016 International Conference on Advanced Cloud and Big Data (CBD).

[13]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[14]  M. Satish Document Clustering with Map Reduce using Hadoop Framework , 2015 .

[15]  Haimonti Dutta,et al.  PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework , 2011 .

[16]  M. Patwary,et al.  DiSC : A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce , 2013 .

[17]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[19]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[20]  Ahmed Rimaz Faizabadi,et al.  An evaluation of MapReduce framework in cluster analysis , 2017, 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT).

[21]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[22]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[23]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[24]  Xianghua Xu,et al.  Design and Implement of Distributed Document Clustering Based on MapReduce , 2009 .

[25]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[26]  B. P. Gautam,et al.  Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents , 2010 .

[27]  YuQing Miao,et al.  A Fast Algorithm for Clustering with MapReduce , 2013, ISNN.

[28]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[29]  Ophir Frieder,et al.  Parallelizing the buckshot algorithm for efficient document clustering , 2002, CIKM '02.

[30]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.