论文信息 - μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

DBSCAN is one of the most popular and effective clustering algorithms that is capable of identifying arbitrary-shaped clusters and noise efficiently. However, its super-linear complexity makes it infeasible for applications involving clustering of Big Data. A major portion of the computation time of DBSCAN is taken up by the neighborhood queries, which becomes a bottleneck to its performance. We address this issue in our proposed micro-cluster based DBSCAN algorithm, μDBSCAN, which identifies core-points even without performing neighbourhood queries and becomes instrumental in reducing the run-time of the algorithm. It also significantly reduces the computation time per neighbourhood query while producing exact DBSCAN clusters. Moreover, the micro-cluster based solution makes it scalable for high dimensional data. We also propose a highly scalable distributed implementation of μDBSCAN, μDBSCAN-D, to exploit a commodity cluster infrastructure. Experimental results demonstrate tremendous improvements in performance of our proposed algorithms as compared to their respective state-of-the-art solutions for various standard datasets. μDBSCAN-D is an exact parallel solution for DBSCAN which is capable of processing massive amounts of data efficiently (1 billion data points in 41 minutes on a 32 node cluster), while producing a clustering that is same as that of traditional DBSCAN.

[1] Cheng-Fa Tsai,et al. GF-DBSCAN: a new efficient and effective data clustering technique for large databases , 2009 .

[2] Massimo Coppola,et al. High-performance data mining with skeleton-based structured parallel programming , 2001, Parallel Comput..

[3] Oxford,et al. Breaking the hierarchy of galaxy formation , 2005, astro-ph/0511338.

[4] Poonam Goyal,et al. Exact, Fast and Scalable Parallel DBSCAN for Commodity Platforms , 2017, ICDCN.

[5] Hans-Peter Kriegel,et al. Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[6] J. Hencil Peter,et al. An Optimised Density Based Clustering Algorithm , 2010 .

[7] Charu C. Aggarwal,et al. Data Clustering: Algorithms and Applications , 2014 .

[8] Pradeep Dubey,et al. Pardicle: Parallel Approximate Density-Based Clustering , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Surendra Byna,et al. BD-CATS: big data clustering at trillion particle scale , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] J. Peacock,et al. Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[11] Jean R. S. Blair,et al. Experiments on Union-Find Algorithms for the Disjoint-Set Data Structure , 2010, SEA.

[12] Ira Assent,et al. AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[13] Cheng-Fa Tsai,et al. KIDBSCAN: A New Efficient Data Clustering Algorithm , 2006, ICAISC.

[14] Morris Riedel,et al. HPDBSCAN: highly parallel DBSCAN , 2015, MLHPC@SC.

[15] Di Ma,et al. MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[16] Charu C. Aggarwal,et al. Data Clustering , 2013 .

[17] Cheng-Fa Tsai,et al. DBSCALE: An efficient density-based clustering algorithm for data mining in large databases , 2010, 2010 Second Pacific-Asia Conference on Circuits, Communications and System.

[18] Mario A. López,et al. R-trees , 2004, Handbook of Data Structures and Applications.

[19] Poonam Goyal,et al. Parallelizing OPTICS for Commodity Clusters , 2015, ICDCN.

[20] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[21] P. Thomas,et al. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model , 2007, astro-ph/0701407.

[22] Philip S. Yu,et al. A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[23] Jae-Gil Lee,et al. RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[24] Barton P. Miller,et al. Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25] Haoyu Tan,et al. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data , 2013, Frontiers of Computer Science.

[26] Wei-keng Liao,et al. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27] Christian S. Jensen,et al. Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[28] Cheng-Fa Tsai,et al. QIDBSCAN: A Quick Density-Based Clustering Technique , 2012, 2012 International Symposium on Computer, Consumer and Control.

[29] Chen Xiaoyun,et al. GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid , 2008, ICEBE.

[30] Ming-Syan Chen,et al. HiClus: Highly Scalable Density-based Clustering with Heterogeneous Cloud , 2015, INNS Conference on Big Data.

[31] A. Rama Mohan Reddy,et al. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method , 2016, Pattern Recognit..

[32] Jian Li,et al. Memory effect in DBSCAN algorithm , 2009, 2009 4th International Conference on Computer Science & Education.

[33] G. Lucia,et al. The hierarchical formation of the brightest cluster galaxies , 2006, astro-ph/0606519.