Grid-based DBSCAN: Indexing and inference

Abstract DBSCAN is one of clustering algorithms which can report arbitrarily-shaped clusters and noises without requiring the number of clusters as a parameter (unlike the other clustering algorithms, k-means, for example). Because the running time of DBSCAN has quadratic order of growth, i.e. O(n2), research studies on improving its performance have been received a considerable amount of attention for decades. Grid-based DBSCAN is a well-developed algorithm whose complexity is improved to O(nlog n) in 2D space, while requiring Ω(n4/3) to solve when dimension  ≥ 3. However, we find that Grid-based DBSCAN suffers from two problems: neighbour explosion and redundancies in merging, which make the algorithms infeasible in high dimensional space. In this paper we first propose a novel algorithm called GDCF which utilizes bitmap indexing to support efficient neighbour grid queries. Second, based on the concept of union-find algorithm we devise a forest-like structure, called cluster forest, to alleviate the redundancies in the merging. Moreover, we find that running the cluster forest in different orders can lead to a different number of merging operations needed to perform in the merging step. We propose to perform the merging step in a uniform random order to optimize the number of merging operations. However, for high-density database, a bottleneck could be occurred, we further propose a low-density-first order to alleviate this bottleneck. The experiments resulted on both real-world and synthetic datasets demonstrate that the proposed algorithm outperforms the state-of-the-art exact/approximate DBSCAN and suggests a good scalability.

[1]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Pradeep Dubey,et al.  Pardicle: Parallel Approximate Density-Based Clustering , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Yike Guo,et al.  Fast density clustering strategies based on the k-means algorithm , 2017, Pattern Recognit..

[4]  Hans-Peter Kriegel,et al.  Parallel Density-Based Clustering of Complex Objects , 2006, PAKDD.

[5]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[6]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[7]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[8]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[9]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[10]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[11]  Ira Assent,et al.  Scalable and Interactive Graph Clustering Algorithm on Multicore CPUs , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[12]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[13]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Otfried Cheong,et al.  Euclidean minimum spanning trees and bichromatic closest pairs , 1990, SCG '90.

[16]  P. Viswanath,et al.  Rough-DBSCAN: A fast hybrid density based clustering method for large data sets , 2009, Pattern Recognit. Lett..

[17]  Ignazio Gallo,et al.  An online document clustering technique for short web contents , 2009, Pattern Recognit. Lett..

[18]  Bo Wang,et al.  Effectively clustering by finding density backbone based-on kNN , 2016, Pattern Recognit..

[19]  Jing Wang,et al.  A distribution density-based methodology for driving data cluster analysis: A case study for an extended-range electric city bus , 2018, Pattern Recognit..

[20]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[21]  Khaled Mahar,et al.  Using grid for accelerating density-based clustering , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[22]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[23]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[24]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[25]  Cao Jing,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2000 .

[26]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[27]  A. Rama Mohan Reddy,et al.  A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method , 2016, Pattern Recognit..

[28]  Chengqi Zhang,et al.  Clustering High-Dimensional Data with Low-Order Neighbors , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[29]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[30]  Duoqian Miao,et al.  A graph-theoretical clustering method based on two rounds of minimum spanning trees , 2010, Pattern Recognit..

[31]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[32]  Howard J. Hamilton,et al.  DBRS: A Density-Based Spatial Clustering Method with Random Sampling , 2003, PAKDD.

[33]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[34]  Didier Stricker,et al.  Introducing a New Benchmarked Dataset for Activity Monitoring , 2012, 2012 16th International Symposium on Wearable Computers.

[35]  Keiichi Tamura,et al.  Cell-Based DBSCAN Algorithm Using Minimum Bounding Rectangle Criteria , 2017, DASFAA Workshops.

[36]  Kai Ming Ting,et al.  Density-ratio based clustering for discovering clusters with varying densities , 2016, Pattern Recognit..

[37]  Zhaohong Deng,et al.  Distance metric learning for soft subspace clustering in composite kernel space , 2016, Pattern Recognit..

[38]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[39]  Stavros J. Perantonis,et al.  K-Nets: Clustering through nearest neighbors networks , 2019, Pattern Recognit..

[40]  David Horn,et al.  The Weight-Shape decomposition of density estimates: A framework for clustering and image analysis algorithms , 2018, Pattern Recognit..

[41]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[42]  Alberto Del Bimbo,et al.  Motion segment decomposition of RGB-D sequences for human behavior understanding , 2017, Pattern Recognit..

[43]  Mohamed A. Ismail,et al.  A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities , 2009, Pattern Recognit..

[44]  Cheng Wang,et al.  A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data , 2018, Pattern Recognit..

[45]  Meng Wang,et al.  Image clustering based on sparse patch alignment framework , 2014, Pattern Recognit..

[46]  Yanchang Zhao,et al.  AGRID: An Efficient Algorithm for Clustering Large High-Dimensional Datasets , 2003, PAKDD.

[47]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[48]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[49]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.