μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.

[1]  Cheng-Fa Tsai,et al.  GF-DBSCAN: a new efficient and effective data clustering technique for large databases , 2009 .

[2]  Massimo Coppola,et al.  High-performance data mining with skeleton-based structured parallel programming , 2001, Parallel Comput..

[3]  Oxford,et al.  Breaking the hierarchy of galaxy formation , 2005, astro-ph/0511338.

[4]  Poonam Goyal,et al.  Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms , 2017, ICDCN.

[5]  Hans-Peter Kriegel,et al.  Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[6]  J. Hencil Peter,et al.  An Optimised Density Based Clustering Algorithm , 2010 .

[7]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[8]  Pradeep Dubey,et al.  Pardicle: Parallel Approximate Density-Based Clustering , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Surendra Byna,et al.  BD-CATS: big data clustering at trillion particle scale , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[11]  Jean R. S. Blair,et al.  Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure , 2010, SEA.

[12]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[13]  Cheng-Fa Tsai,et al.  KIDBSCAN: A New Efficient Data Clustering Algorithm , 2006, ICAISC.

[14]  Morris Riedel,et al.  HPDBSCAN: highly parallel DBSCAN , 2015, MLHPC@SC.

[15]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[16]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[17]  Cheng-Fa Tsai,et al.  DBSCALE: An efficient density-based clustering algorithm for data mining in large databases , 2010, 2010 Second Pacific-Asia Conference on Circuits, Communications and System.

[18]  Mario A. López,et al.  R-trees , 2004, Handbook of Data Structures and Applications.

[19]  Poonam Goyal,et al.  Parallelizing OPTICS for Commodity Clusters , 2015, ICDCN.

[20]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[21]  P. Thomas,et al.  The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model , 2007, astro-ph/0701407.

[22]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[23]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[24]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Haoyu Tan,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[26]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Christian S. Jensen,et al.  Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[28]  Cheng-Fa Tsai,et al.  QIDBSCAN: A Quick Density-Based Clustering Technique , 2012, 2012 International Symposium on Computer, Consumer and Control.

[29]  Chen Xiaoyun,et al.  GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid , 2008, ICEBE.

[30]  Ming-Syan Chen,et al.  HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud , 2015, INNS Conference on Big Data.

[31]  A. Rama Mohan Reddy,et al.  A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method , 2016, Pattern Recognit..

[32]  Jian Li,et al.  Memory effect in DBSCAN algorithm , 2009, 2009 4th International Conference on Computer Science & Education.

[33]  G. Lucia,et al.  The hierarchical formation of the brightest cluster galaxies , 2006, astro-ph/0606519.