BLOCK-DBSCAN: Fast clustering for large scale data

Abstract We analyze the drawbacks of DBSCAN and its variants, and find the grid technique, which is used in Fast-DBSCAN and ρ-approximate DBSCAN, is almost useless in high dimensional data space. Because it usually yields considerable redundant distance computations. In order to tame these problems, two techniques are proposed: one is to use ϵ 2 -norm ball to identify Inner Core Blocks within which all points are core points, it has higher efficiency than grid technique for finding more core points at one time; the other is a fast approximate algorithm for judging whether two Inner Core Blocks are density-reachable from each other. Besides, cover tree is also used to accelerate the process of density computations. Based on the three techniques, an approximate approach, namely BLOCK-DBSCAN, is proposed for large scale data, which runs in about O(nlog (n)) expected time and obtains almost the same result as DBSCAN. BLOCK-DBSCAN has two versions, i.e., L2 version can work well for relatively high dimensional data, and L∞ version is suitable for high dimensional data. Experimental results show that BLOCK-DBSCAN is promising and outperforms NQDBSCAN, ρ-approximate DBSCAN and AnyDBC.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Matteo Dell'Amico,et al.  NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[3]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[5]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[6]  Naixue Xiong,et al.  DHeat: A Density Heat-Based Algorithm for Clustering With Effective Radius , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[7]  Otfried Cheong,et al.  Euclidean minimum spanning trees and bichromatic closest pairs , 1991, Discret. Comput. Geom..

[8]  Shengcai Liao,et al.  Vehicle Re-Identification Using Quadruple Directional Deep Learning Features , 2018, IEEE Transactions on Intelligent Transportation Systems.

[9]  Shashi Shekhar,et al.  Discovering personally meaningful places: An interactive clustering approach , 2007, TOIS.

[10]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[11]  Maoguo Gong,et al.  Structured self-attention architecture for graph-level representation learning , 2020, Pattern Recognit..

[12]  Haibin Ling,et al.  Attention guided deep audio-face fusion for efficient speaker naming , 2019, Pattern Recognit..

[13]  Xin Liu,et al.  Fast density peak clustering for large scale data based on kNN , 2020, Knowl. Based Syst..

[14]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[15]  Hongbin Zha,et al.  Trinary-Projection Trees for Approximate Nearest Neighbor Search , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Khaled Mahar,et al.  Using grid for accelerating density-based clustering , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[18]  Zenglin Xu,et al.  Robust Graph Learning From Noisy Data , 2018, IEEE Transactions on Cybernetics.

[19]  Benjamin B. Kimia,et al.  Metric-based shape retrieval in large databases , 2002, Object recognition supported by user interaction for service robots.

[20]  Cheng Wang,et al.  A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data , 2018, Pattern Recognit..

[21]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Miao Wei,et al.  Fuzzy clustering based on feature weights for multivariate time series , 2020, Knowl. Based Syst..

[23]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Philip S. Yu,et al.  Mining Cluster-Based Temporal Mobile Sequential Patterns in Location-Based Service Environments , 2011, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jian-Huang Lai,et al.  APSCAN: A parameter free algorithm for clustering , 2011, Pattern Recognit. Lett..

[26]  Avory Bryant,et al.  RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates , 2018, IEEE Transactions on Knowledge and Data Engineering.

[27]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[28]  Cheng Wang,et al.  Fast neighbor search by using revised k-d tree , 2019, Inf. Sci..

[29]  Cheng Wang,et al.  Decentralized Clustering by Finding Loose and Distributed Density Cores , 2018, Inf. Sci..

[30]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[31]  Fabian Gieseke,et al.  Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs , 2014, ICML.

[32]  Yan Zhang,et al.  Flexible Auto-Weighted Local-Coordinate Concept Factorization: A Robust Framework for Unsupervised Clustering , 2019, IEEE Transactions on Knowledge and Data Engineering.

[33]  Yu Xie,et al.  Community discovery in networks with deep sparse filtering , 2018, Pattern Recognit..

[34]  Mohammad Al Hasan,et al.  SPARCL: Efficient and Effective Shape-Based Clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[35]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[36]  Nizar Bouguila,et al.  Semi-Convex Hull Tree: Fast Nearest Neighbor Queries for Large Scale Data on GPUs , 2018, 2018 IEEE International Conference on Data Mining (ICDM).