Pardicle: Parallel Approximate Density-Based Clustering

DBSCAN is a widely used is density-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for DBSCAN using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56× faster than exact algorithms with almost identical quality (Omega-Index ≥ 0.99). We develop a new parallel DBSCAN algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15× using 16 cores, single node Intel® Xeon® processor) and distributed memory (3917× using 4096 cores, multinode) computers, with 2× additional performance improvement using Intel® Xeon Phi™ coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.

[1]  Xin-Hua Gao,et al.  Membership determination of open cluster NGC 188 based on the DBSCAN clustering algorithm , 2014 .

[2]  Md. Mostofa Ali Patwary,et al.  Multi-core Spanning Forest Algorithms using the Disjoint-set Data Structure , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  D. Lowe,et al.  Fast Matching of Binary Features , 2012, 2012 Ninth Conference on Computer and Robot Vision.

[4]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  S. White,et al.  Galaxy formation in WMAP1 and WMAP7 cosmologies , 2012, 1206.0052.

[6]  G. Lemson,et al.  Halo and Galaxy Formation Histories from the Millennium Simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony , 2006, astro-ph/0608019.

[7]  Oxford,et al.  Breaking the hierarchy of galaxy formation , 2005, astro-ph/0511338.

[8]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[9]  Wei-keng Liao,et al.  A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Juntae Kim,et al.  The Anomaly Detection by Using DBSCAN Clustering with Multiple Parameters , 2011, 2011 International Conference on Information Science and Applications.

[11]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12]  Md. Mostofa Ali Patwary,et al.  A Scalable Parallel Union-Find Algorithm for Distributed Memory Computers , 2009, PPAM.

[13]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[14]  Jean R. S. Blair,et al.  Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure , 2010, SEA.

[15]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[16]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[17]  Politecnico di Milano,et al.  γ-ray DBSCAN: a clustering algorithm applied to Fermi-LAT γ-ray data - I. Detection performances with real and simulated data , 2012 .

[18]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[19]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[20]  Cheng Li,et al.  Erratum: From dwarf spheroidals to cD galaxies: simulating the galaxy population in a ΛCDM cosmology , 2010, 1006.0106.

[21]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[22]  Nikos Karampatziakis,et al.  Online Discovery of Group Level Events in Time Series , 2014, SDM.

[23]  P. Viswanath,et al.  l-DBSCAN : A Fast Hybrid Density Based Clustering Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[24]  Mete Celik,et al.  Anomaly detection in temperature data using DBSCAN algorithm , 2011, 2011 International Symposium on Innovations in Intelligent Systems and Applications.

[25]  Giuseppe Carenini,et al.  Using the Omega Index for Evaluating Abstractive Community Detection , 2012, EvalMetrics@NAACL-HLT.

[26]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[27]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[28]  Anand Raghunathan,et al.  Best-effort parallel execution framework for Recognition and mining applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[30]  Man Zhu,et al.  Isolating ships from shape curve with DBSCAN , 2013, 2013 25th Chinese Control and Decision Conference (CCDC).

[31]  Robert E. Tarjan,et al.  A Class of Algorithms which Require Nonlinear Time to Maintain Disjoint Sets , 1979, J. Comput. Syst. Sci..

[32]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[33]  Jafar Habibi,et al.  An approximation algorithm for finding skeletal points for density based clustering approaches , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[34]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[35]  Ujjwal Maulik,et al.  Unsupervised Satellite Image Segmentation by Combining SA Based Fuzzy Clustering with Support Vector Machine , 2009, 2009 Seventh International Conference on Advances in Pattern Recognition.

[36]  L. Collins,et al.  Omega: A General Formulation of the Rand Index of Cluster Recovery Suitable for Non-disjoint Solutions. , 1988, Multivariate behavioral research.

[37]  Mihai Surdeanu,et al.  A hybrid unsupervised approach for document clustering , 2005, KDD '05.

[38]  Ying Liu,et al.  Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[39]  Christian Böhm,et al.  Efficient Anytime Density-based Clustering , 2013, SDM.

[40]  Katrin Heitmann,et al.  THE STRUCTURE OF HALOS: IMPLICATIONS FOR GROUP AND CLUSTER COSMOLOGY , 2008, 0803.3624.

[41]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[42]  G. Lucia,et al.  The hierarchical formation of the brightest cluster galaxies , 2006, astro-ph/0606519.

[43]  Paul A. Watters,et al.  Determining provenance in phishing websites using automated conceptual analysis , 2009, 2009 eCrime Researchers Summit.

[44]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[45]  P. Thomas,et al.  The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model , 2007, astro-ph/0701407.

[46]  Shiyin Huo Detecting Self-Correlation of Nonlinear, Lognormal, Time-Series Data via DBSCAN Clustering Method, Using Stock Price Data as Example , 2011 .

[47]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[48]  Matthew B. Kennel KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space , 2004 .