Theoretically-Efficient and Practical Parallel DBSCAN

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take O(nłog n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with two-way hyper-threading show that our implementations outperform existing parallel implementations by up to several orders of magnitude, and achieve speedups of up to 33x over the best sequential algorithms.

[1]  Morris Riedel,et al.  HPDBSCAN: highly parallel DBSCAN , 2015, MLHPC@SC.

[2]  Guy E. Blelloch,et al.  Parallel Write-Efficient Algorithms and Data Structures for Computational Geometry , 2018, SPAA.

[3]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[4]  Bernard Chazelle,et al.  How to Search in History , 1983, Inf. Control..

[5]  Mohamed A. Ismail,et al.  An efficient density based clustering algorithm for large databases , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[6]  Xiufen Fu,et al.  Research and Application of DBSCAN Algorithm Based on Hadoop Platform , 2013, ICPCA/SWS.

[7]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[8]  Kenneth L. Clarkson,et al.  A Randomized Algorithm for Closest-Point Queries , 1988, SIAM J. Comput..

[9]  Michiel H. M. Smid,et al.  Geometric Algorithms for Density-based Data Clustering , 2005, Int. J. Comput. Geom. Appl..

[10]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[11]  Hans-Peter Kriegel,et al.  Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[12]  Sunil Arya,et al.  Approximate range searching , 1995, SCG '95.

[13]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Guy E. Blelloch,et al.  Phase-concurrent hash tables for determinism , 2014, SPAA.

[16]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[17]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[18]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[19]  Xicheng Tan,et al.  Research on the Parallelization of the DBSCAN Clustering Algorithm for Spatial Data Mining Based on the Spark Platform , 2017, Remote. Sens..

[20]  Surendra Byna,et al.  BD-CATS: big data clustering at trillion particle scale , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[22]  Yonggang Zhang,et al.  Cludoop: An Efficient Distributed Density-Based Clustering for Big Data Using Hadoop , 2015, Int. J. Distributed Sens. Networks.

[23]  Guy E. Blelloch,et al.  Parallelism in Randomized Incremental Algorithms , 2018, J. ACM.

[24]  Fuling Bian,et al.  A Grid and Density Based Fast Spatial Clustering Algorithm , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[25]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[26]  Cheng-Fa Tsai,et al.  GF-DBSCAN: a new efficient and effective data clustering technique for large databases , 2009 .

[27]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[28]  Mark de Berg,et al.  Faster DB-scan and HDB-scan in Low-Dimensional Euclidean Spaces , 2017, ISAAC.

[29]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[30]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Ling Tian,et al.  A Parallel DBSCAN Algorithm Based on Spark , 2016, 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom).

[32]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[33]  Massimo Coppola,et al.  High-performance data mining with skeleton-based structured parallel programming , 2001, Parallel Comput..

[34]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[35]  Ashish Sharma,et al.  An Enhanced Density Based Spatial Clustering of Applications with Noise , 2009, 2009 IEEE International Advance Computing Conference.

[36]  Ming-Syan Chen,et al.  HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud , 2015, INNS Conference on Big Data.

[37]  Wei-keng Liao,et al.  Scalable parallel OPTICS data clustering using graph algorithmic techniques , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Guy E. Blelloch,et al.  A Top-Down Parallel Semisort , 2015, SPAA.

[39]  A. Rama Mohan Reddy,et al.  A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method , 2016, Pattern Recognit..

[40]  Marzena Kryszkiewicz,et al.  TI-DBSCAN: Clustering with DBSCAN by Means of the Triangle Inequality , 2010, RSCTC.

[41]  Jian-Huang Lai,et al.  APSCAN: A parameter free algorithm for clustering , 2011, Pattern Recognit. Lett..

[42]  Yufei Tao,et al.  On the Hardness and Approximation of Euclidean DBSCAN , 2017, ACM Trans. Database Syst..

[43]  A Ade Gunawan,et al.  A faster algorithm for DBSCAN , 2013 .

[44]  Richard Cole,et al.  Finding minimum spanning forests in logarithmic time and linear work using random sampling , 1996, SPAA '96.

[45]  Magdalena Balazinska,et al.  Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster , 2010, SSDBM.

[46]  Wei-keng Liao,et al.  A Novel Scalable DBSCAN Algorithm with Spark , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[47]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[48]  Weizhong Zhao,et al.  Research on Parallel DBSCAN Algorithm Design Based on MapReduce , 2011 .

[49]  Uri Zwick,et al.  Optimal randomized EREW PRAM algorithms for finding spanning forests and for other basic graph connectivity problems , 1996, SODA '96.

[50]  Khaled Mahar,et al.  Using grid for accelerating density-based clustering , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[51]  Jun Huang,et al.  A Communication Efficient Parallel DBSCAN Algorithm based on Parameter Server , 2017, CIKM.

[52]  P. Viswanath,et al.  l-DBSCAN : A Fast Hybrid Density Based Clustering Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[53]  Poonam Goyal,et al.  μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[54]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[55]  Matteo Dell'Amico,et al.  NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[56]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[57]  Hans-Peter Kriegel,et al.  Efficient density-based clustering of complex objects , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[58]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[59]  Lei Liu,et al.  A MapReduce-based improvement algorithm for DBSCAN , 2018 .

[60]  Uzi Vishkin,et al.  Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques , 2008 .

[61]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[62]  Haoyu Tan,et al.  MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[63]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[64]  Teng-Sheng Moh,et al.  DBSCAN on Resilient Distributed Datasets , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[65]  John H. Reif,et al.  Optimal randomized parallel algorithms for computational geometry , 2005, Algorithmica.

[66]  Seth Pettie,et al.  A Randomized Time-Work Optimal Parallel Algorithm for Finding a Minimum Spanning Forest , 1999, RANDOM-APPROX.

[67]  Otfried Cheong,et al.  Euclidean minimum spanning trees and bichromatic closest pairs , 1991, Discret. Comput. Geom..

[68]  Dilip B. Kotak,et al.  GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[69]  Pradeep Dubey,et al.  Pardicle: Parallel Approximate Density-Based Clustering , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[70]  Michiel H. M. Smid,et al.  Space-efficient geometric divide-and-conquer algorithms , 2007, Comput. Geom..

[71]  Heinrich Jiang,et al.  DBSCAN++: Towards fast and scalable density clustering , 2018, ICML.

[72]  Bing Liu,et al.  A Fast Density-Based Clustering Algorithm for Large Databases , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[73]  Guy E. Blelloch,et al.  Low depth cache-oblivious algorithms , 2010, SPAA '10.

[74]  Qin Wei,et al.  A Novel DBSCAN Based on Binary Local Sensitive Hashing and Binary-KNN Representation , 2017, Adv. Multim..

[75]  Pat Morin Optimal Randomized Parallel Algorithms for Computational Geometry , 2007 .

[76]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[77]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[78]  Xing Xie,et al.  Learning transportation mode from raw gps data for geographic applications on the web , 2008, WWW.

[79]  Javam C. Machado,et al.  G2P: A Partitioning Approach for Processing DBSCAN with MapReduce , 2015, W2GIS.

[80]  Jing Li,et al.  A new hybrid method based on partitioning-based DBSCAN and ant clustering , 2011, Expert Syst. Appl..

[81]  Xue-Jie Zhang,et al.  A Linear DBSCAN Algorithm Based on LSH , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[82]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[83]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[84]  Uzi Vishkin,et al.  Towards a theory of nearly constant time parallel algorithms , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[85]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[86]  Uri Zwick,et al.  An optimal randomized logarithmic time connectivity algorithm for the EREW PRAM (extended abstract) , 1994, SPAA '94.

[87]  Kwan-Hee Yoo,et al.  AA-DBSCAN: an approximate adaptive DBSCAN for finding clusters with varying densities , 2018, The Journal of Supercomputing.

[88]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[89]  Hans-Peter Kriegel,et al.  Scalable Density-Based Distributed Clustering , 2004, PKDD.