Big Data Clustering Using MapReduce Framework: A Review

The clustering is an essential technique of data analysis that extracts distribution patterns or similar groups within data. Because of the crucial role of clustering in many scientific applications, numerous research is concerned with developing new algorithms for big data clustering. Despite this fact, the clustering remains a challenge in big data as the size and variety of datasets are rapidly increasing in the real-world.

[1]  Jiangchuan Liu,et al.  Statistics and Social Network of YouTube Videos , 2008, 2008 16th Interntional Workshop on Quality of Service.

[2]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[3]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[4]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[5]  Dazhi Chong,et al.  Big data analytics: a literature review , 2015 .

[6]  Yantao Li,et al.  An Efficient MapReduce-Based Parallel Clustering Algorithm for Distributed Traffic Subarea Division , 2015 .

[7]  Inderveer Chana,et al.  A Hybrid Approach for Efficient Clustering of Big Data , 2014 .

[8]  Thomas Seidl,et al.  Projected Clustering for Huge Data Sets in MapReduce , 2014, EDBT.

[9]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[10]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[11]  Martin Ester,et al.  P3C: A Robust Projected Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[12]  Fakhri Karray,et al.  Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce , 2013, SDM.

[13]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[14]  Mahmoud Elbattah,et al.  Clustering-Aided Approach for Predicting Patient Outcomes with Application to Elderly Healthcare in Ireland , 2017, AAAI Workshops.

[15]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[16]  Alok N. Choudhary,et al.  Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce , 2015, SpringSim.

[17]  Ganesh Chandra Deka,et al.  The Human Element of Big Data: Issues, Analytics, and Performance , 2016 .

[18]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[19]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[20]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[21]  Rinkle Rani,et al.  Comparative analysis of density based outlier detection techniques on breast cancer data using hadoop and map reduce , 2016, 2016 International Conference on Inventive Computation Technologies (ICICT).

[22]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[23]  Lin Gao,et al.  Prediction of Human Disease-Related Gene Clusters by Clustering Analysis , 2011, International journal of biological sciences.

[24]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[25]  Hui Gao,et al.  A New Agglomerative Hierarchical Clustering Algorithm Implementation based on the Map Reduce Framework , 2010, J. Digit. Content Technol. its Appl..

[26]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[27]  AghabozorgiSaeed,et al.  Iterative big data clustering algorithms , 2016 .

[28]  Lei Liu,et al.  A MapReduce-based improvement algorithm for DBSCAN , 2018 .

[29]  Tanvir Habib Sardar,et al.  An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm , 2018, Future Computing and Informatics Journal.

[30]  Jonathan Magnusson,et al.  Social Network Analysis Utilizing Big Data Technology , 2012 .

[31]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[32]  Phayung Meesad,et al.  Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method , 2014, KSE.

[33]  Zack Walsh,et al.  DATA MINING CLUSTERING: A HEALTHCARE APPLICATION , 2011 .

[34]  Hui Yu Summarization for Internet News Based on Clustering Algorithm , 2009, 2009 International Conference on Computational Intelligence and Natural Computing.

[35]  Ying Wah Teh,et al.  Iterative big data clustering algorithms: a review , 2016, Softw. Pract. Exp..

[36]  Lamjed Ben Said,et al.  An Improved MapReduce Design of Kmeans with Iteration Reducing for Clustering Stock Exchange Very Large Datasets , 2015, 2015 11th International Conference on Semantics, Knowledge and Grids (SKG).

[37]  Leonid Oliker,et al.  Efficient and accurate clustering for large-scale genetic mapping , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  P. Niranjan,et al.  Parallel Optimal Grid-Clustering algorithm exploration on MapReduce Framework , 2018 .

[39]  Zhenhong Du,et al.  A parallel varied density-based clustering algorithm with optimized data partition , 2018 .

[40]  Ibrahim Aljarah,et al.  Parallel glowworm swarm optimization clustering algorithm based on MapReduce , 2014, 2014 IEEE Symposium on Swarm Intelligence.

[41]  Shu-Ching Chen,et al.  Computational Health Informatics in the Big Data Age , 2016, ACM Comput. Surv..

[42]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[43]  D. Pham,et al.  An Incremental K-means algorithm , 2004 .

[44]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[45]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[46]  Dilip B. Kotak,et al.  GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[47]  E. Nyström Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben , 1930 .

[48]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[49]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[50]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[51]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[52]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[53]  Thomas Triplet,et al.  PatchWork, a scalable density-grid clustering algorithm , 2016, SAC.

[54]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[55]  Albert Y. Zomaya,et al.  Big Data Management and Processing , 2017 .

[56]  M. Satish Document Clustering with Map Reduce using Hadoop Framework , 2015 .

[57]  Feng Li,et al.  An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[58]  Mohamed Medhat Gaber,et al.  Clustering-Based Spatio-Temporal Analysis of Big Atmospheric Data , 2016, ICC 2016.

[59]  Subrat Kumar Dash,et al.  VDMR-DBSCAN: Varied Density MapReduce DBSCAN , 2015, BDA.

[60]  T. Velmurugan,et al.  A Survey of Partition based Clustering Algorithms in Data Mining: An Experimental Approach , 2011 .

[61]  Weizhong Zhao,et al.  Research on Parallel DBSCAN Algorithm Design Based on MapReduce , 2011 .

[62]  Van-Hau Pham,et al.  Parallel Two-Phase K-Means , 2013, ICCSA.

[63]  Yonggang Zhang,et al.  Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop , 2015, Int. J. Distributed Sens. Networks.

[64]  Ibrahim Aljarah,et al.  Parallel particle swarm optimization clustering algorithm based on MapReduce methodology , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).

[65]  Kehua Yang,et al.  Research and application of MapReduce-based MST text clustering algorithm , 2012, 2012 IEEE International Conference on Information Science and Technology.

[66]  Amir Khoshkbarchi,et al.  A modified hybrid Fuzzy clustering method for big data , 2016, 2016 8th International Symposium on Telecommunications (IST).

[67]  Christoph F. Eick,et al.  MR-SNN: Design of parallel Shared Nearest Neighbor clustering algorithm using MapReduce , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[68]  Ping Zhou,et al.  Large-Scale Data Sets Clustering Based on MapReduce and Hadoop , 2011 .

[69]  Jan Murlewski,et al.  Clustering algorithms for bank customer segmentation , 2005, 5th International Conference on Intelligent Systems Design and Applications (ISDA'05).