Density-based Algorithms for Big Data Clustering Using MapReduce Framework

Clustering is used to extract hidden patterns and similar groups from data. Therefore, clustering as a method of unsupervised learning is a crucial technique for big data analysis owing to the massive number of unlabeled objects involved. Density-based algorithms have attracted research interest, because they help to better understand complex patterns in spatial datasets that contain information about data related to co-located objects. Big data clustering is a challenging task, because the volume of data increases exponentially. However, clustering using MapReduce can help answer this challenge. In this context, density-based algorithms in MapReduce have been largely investigated in the past decade to eliminate the problem of big data clustering. Despite the diversity of the algorithms proposed, the field lacks a structured review of the available algorithms and techniques for desirable partitioning, local clustering, and merging. This study formalizes the problem of density-based clustering using MapReduce, proposes a taxonomy to categorize the proposed algorithms, and provides a systematic and comprehensive comparison of these algorithms according to the partitioning technique, type of local clustering, merging technique, and exactness of their implementations. Finally, the study highlights outstanding challenges and opportunities to contribute to the field of density-based clustering using MapReduce.

[1]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[2]  Liang Zhao,et al.  The EMBERS architecture for streaming predictive analytics , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  Stefan Rugel Improving customer centricity by End-to-End monitoring and big data analytics , 2014, 2014 Euro Med Telco Conference (EMTC).

[4]  Ying Wah Teh,et al.  Iterative big data clustering algorithms: a review , 2016, Softw. Pract. Exp..

[5]  Fabrício Olivetti de França,et al.  A hash-based co-clustering algorithm for categorical data , 2014, Expert Syst. Appl..

[6]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[7]  Yu Cao,et al.  Cost-Based Optimization of Logical Partitions for a Query Workload in a Hadoop Data Warehouse , 2014, APWeb.

[8]  C. Shoba Bindu,et al.  A review on density-based clustering algorithms for big data analysis , 2017, 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC).

[9]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[10]  Pradeep Singh,et al.  Survey of density based clustering algorithms and its variants , 2017, 2017 International Conference on Inventive Computing and Informatics (ICICI).

[11]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[12]  Julian Shun,et al.  Theoretically-Efficient and Practical Parallel DBSCAN , 2020, SIGMOD Conference.

[13]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[14]  Mohammad Al Hasan,et al.  Under consideration for publication in Knowledge and Information Systems SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters 1 , 2022 .

[15]  Weizhong Zhao,et al.  Research on Parallel DBSCAN Algorithm Design Based on MapReduce , 2011 .

[16]  Christos Faloutsos,et al.  On packing R-trees , 1993, CIKM '93.

[17]  Hasso Plattner,et al.  Leveraging in-memory technology for interactive analyses of point-of-sales data , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[18]  Anton Spivak,et al.  Data Preloading and Data Placement for MapReduce Performance Improving , 2016 .

[19]  Mohamed Bouguessa Clustering categorical data in projected spaces , 2013, Data Mining and Knowledge Discovery.

[20]  Timos K. Sellis,et al.  A model for the prediction of R-tree performance , 1996, PODS.

[21]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[22]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[23]  Poonam Goyal,et al.  μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Zhenhong Du,et al.  A parallel varied density-based clustering algorithm with optimized data partition , 2018 .

[25]  Xiao Xu,et al.  A Fast Density Peaks Clustering Algorithm Based on Pre-Screening , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[26]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[27]  Chang-Dong Wang,et al.  Nonlinear Clustering: Methods and Applications , 2016 .

[28]  Christoph F. Eick,et al.  MR-SNN: Design of parallel Shared Nearest Neighbor clustering algorithm using MapReduce , 2017, 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(.

[29]  Wadii Boulila,et al.  A comparison of big remote sensing data processing with Hadoop MapReduce and Spark , 2018, 2018 4th International Conference on Advanced Technologies for Signal and Image Processing (ATSIP).

[30]  Kourosh Kiani,et al.  A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark , 2018, Symmetry.

[31]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[32]  Yang Lu,et al.  Big data analytics and big data science: a survey , 2016 .

[33]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[34]  Xiao Qin,et al.  A relevant subspace based contextual outlier mining algorithm , 2016, Knowl. Based Syst..

[35]  Amar Mani Aryal,et al.  SparkSNN: A density-based clustering algorithm on spark , 2018, 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).

[36]  SatyaJaswanth Badri A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets , 2019 .

[37]  Mohammed J. Zaki,et al.  Efficient algorithms for mining arbitrary shaped clusters , 2009 .

[38]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[39]  Lixiang Li,et al.  Nearest neighbors based density peaks approach to intrusion detection , 2018 .

[40]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[41]  Shaun Hipgrave Smarter fraud investigations with big data analytics , 2013, Netw. Secur..

[42]  Sayali Ashok Shivarkar Speed-up Extension to Hadoop System , 2014 .

[43]  Bingsheng He,et al.  A Survey on Spark Ecosystem for Big Data Processing , 2018, ArXiv.

[44]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[45]  Shu-Ching Chen,et al.  Computational Health Informatics in the Big Data Age , 2016, ACM Comput. Surv..

[46]  Ahmed Eldawy,et al.  SHAHED: A MapReduce-based system for querying and visualizing spatio-temporal satellite data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[47]  Ge Yu,et al.  Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce , 2016, IEEE Transactions on Knowledge and Data Engineering.

[48]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[49]  Fan Yu,et al.  Towards large-scale twitter mining for drug-related adverse events , 2012, SHB '12.

[50]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[51]  AghabozorgiSaeed,et al.  Iterative big data clustering algorithms , 2016 .

[52]  Lei Liu,et al.  A MapReduce-based improvement algorithm for DBSCAN , 2018 .

[53]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[54]  Jonathan Magnusson,et al.  Social Network Analysis Utilizing Big Data Technology , 2012 .

[55]  S. D. Madhu Kumar,et al.  A Dynamic Data Placement Scheme for Hadoop Using Real-time Access Patterns , 2013, 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[56]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[57]  Dazhi Chong,et al.  Big data analytics: a literature review , 2015 .

[58]  Xiao Qin,et al.  PUMA: Parallel subspace clustering of categorical data using multi-attribute weights , 2019, Expert Syst. Appl..

[59]  Dino Pedreschi,et al.  Overlap versus partition: Marketing classification and customer profiling in complex networks of products , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[60]  Haoyu Tan,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[61]  Xubin He,et al.  Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[62]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[63]  Xian-He Sun,et al.  ADAPT: Availability-Aware MapReduce Data Placement for Non-dedicated Distributed Computing , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[64]  Din J. Wasem Mining of Massive Datasets , 2014 .

[65]  William H. Dutton,et al.  Clouds, big data, and smart assets: Ten tech-enabled business trends to watch , 2010 .

[66]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[67]  Vinod S. Bawane,et al.  Clustering Algorithms in MapReduce: A Review , 2015 .

[68]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[69]  Sanja Scitovski,et al.  A density-based clustering algorithm for earthquake zoning , 2018, Comput. Geosci..

[70]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[71]  Virgilijus Sakalauskas,et al.  Research of customer behavior anomalies in big financial data , 2014, 2014 14th International Conference on Hybrid Intelligent Systems.

[72]  M. Narasimha Murty,et al.  Modern Technologies for Big Data Classification and Clustering , 2017 .

[73]  Jin Dong,et al.  Big data technologies in support of real time capturing and understanding of electric vehicle customers dynamics , 2014, 2014 IEEE 5th International Conference on Software Engineering and Service Science.

[74]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[75]  Joydeep Ghosh,et al.  Data Clustering Algorithms And Applications , 2013 .

[76]  Mohammad F. Hassanin,et al.  DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise , 2015, 2015 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT).

[77]  Harpreet Kaur,et al.  Spatial big data analytics of influenza epidemic in Vellore, India , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[78]  Xiao He,et al.  Relevant overlapping subspace clusters on categorical data , 2014, KDD.

[79]  Eugenio Cesario,et al.  Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[80]  Mohamed Medhat Gaber,et al.  Clustering-Based Spatio-Temporal Analysis of Big Atmospheric Data , 2016, ICC 2016.

[81]  Yike Guo,et al.  Fast density clustering strategies based on the k-means algorithm , 2017, Pattern Recognit..

[82]  Subrat Kumar Dash,et al.  VDMR-DBSCAN: Varied Density MapReduce DBSCAN , 2015, BDA.

[83]  Haimonti Dutta,et al.  PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework , 2011 .

[84]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[85]  Ricardo J. G. B. Campello,et al.  Hierarchical Density-Based Clustering Using MapReduce , 2019, IEEE Transactions on Big Data.

[86]  Dilip B. Kotak,et al.  GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[87]  Thomas Triplet,et al.  PatchWork, a scalable density-grid clustering algorithm , 2016, SAC.

[88]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[89]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[90]  Mordechai Guri,et al.  Analyzing users' web surfing patterns to trace terrorists and criminals , 2013, 2013 IEEE International Conference on Intelligence and Security Informatics.

[91]  Yonggang Zhang,et al.  Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop , 2015, Int. J. Distributed Sens. Networks.

[92]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[93]  Shohei Yokoyama,et al.  EBSCAN: An Entanglement-based Algorithm for Discovering Dense Regions in Large Geo-social Data Streams with Noise , 2015, LBSN@SIGSPATIAL/GIS.

[94]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[95]  Mario A. López,et al.  STR: a simple and efficient algorithm for R-tree packing , 1997, Proceedings 13th International Conference on Data Engineering.

[96]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[97]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[98]  Inderveer Chana,et al.  A survey of clustering techniques for big data analysis , 2014, 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence).

[99]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[100]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[101]  Xiao Qin,et al.  Parallel Hierarchical Subspace Clustering of Categorical Data , 2019, IEEE Transactions on Computers.

[102]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[103]  Zhipeng Gao,et al.  MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce for Large Multi-dimensional Datasets , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[104]  Surendra Byna,et al.  BD-CATS: big data clustering at trillion particle scale , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[105]  Hans-Peter Kriegel,et al.  Data bubbles: quality preserving performance boosting for hierarchical clustering , 2001, SIGMOD '01.

[106]  Kyuseok Shim,et al.  High-dimensional similarity joins , 1997, Proceedings 13th International Conference on Data Engineering.

[107]  Kourosh Kiani,et al.  A big data driven distributed density based hesitant fuzzy clustering using Apache spark with application to gene expression microarray , 2019, Eng. Appl. Artif. Intell..

[108]  何耀彬,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013 .

[109]  Kyung Mi Lee,et al.  A Locality Sensitive Hashing Technique for Categorical Data , 2012 .

[110]  Antoni Munar,et al.  A Big Data Financial Information Management Architecture for Global Banking , 2014, 2014 International Conference on Future Internet of Things and Cloud.

[111]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[112]  Siti Zaiton Mohd Hashim,et al.  Cluster Analysis on High-Dimensional Data: A Comparison of Density-based Clustering Algorithms , 2013 .