Comparative evaluation of region query strategies for DBSCAN clustering

Abstract Clustering is a technique that allows data to be organized into groups of similar objects. DBSCAN ( Density-Based Spatial Clustering of Applications with Noise ) constitutes a popular clustering algorithm that relies on a density-based notion of cluster and is designed to discover clusters of arbitrary shape. The computational complexity of DBSCAN is dominated by the calculation of the ϵ-neighborhood for every object in the dataset. Thus, the efficiency of DBSCAN can be improved in two different ways: (1) by reducing the overall number of ϵ-neighborhood queries (also known as region queries), or (2) by reducing the complexity of the nearest neighbor search conducted for each region query. This paper deals with the first issue by considering the most relevant region query strategies for DBSCAN, all of them characterized by inspecting the neighborhoods of only a subset of the objects in the dataset. We comparatively evaluate these region query strategies (or DBSCAN variants) in terms of clustering effectiveness and efficiency; additionally, a novel region query strategy is introduced in this work. The results show that some DBSCAN variants are only slightly inferior to DBSCAN in terms of effectiveness, while greatly improving its efficiency. Among these variants, the novel one outperforms the rest.

[1]  Shokri Z. Selim,et al.  A simulated annealing algorithm for the clustering problem , 1991, Pattern Recognit..

[2]  Archana Shirke,et al.  Empirical Analysis of Data Clustering Algorithms , 2018 .

[3]  Jin-Kao Hao,et al.  Iterated variable neighborhood search for the capacitated clustering problem , 2016, Eng. Appl. Artif. Intell..

[4]  Zhou Shui FDBSCAN: A Fast DBSCAN Algorithm , 2000 .

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  D.K. Bhattacharyya,et al.  An improved sampling-based DBSCAN for large spatial databases , 2004, International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of.

[7]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[8]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[9]  Khaled S. Al-Sultan,et al.  A Tabu search approach to the clustering problem , 1995, Pattern Recognit..

[10]  Li Wang,et al.  CUBN: A clustering algorithm based on density and distance , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[11]  Gang Chen,et al.  Evolutionary clustering with differential evolution , 2014, 2014 IEEE Congress on Evolutionary Computation (CEC).

[12]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[13]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[14]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[15]  Cao Jing,et al.  Approaches for scaling DBSCAN algorithm to large spatial databases , 2000 .

[16]  Simulated Annealing Clustering for Optimum GPS Satellite Selection , 2012 .

[17]  Jae-Gil Lee,et al.  RP-DBSCAN: A Superfast Parallel DBSCAN Algorithm Based on Random Partitioning , 2018, SIGMOD Conference.

[18]  Lei Liu,et al.  A MapReduce-based improvement algorithm for DBSCAN , 2018 .

[19]  Bo Yuan,et al.  Density-Based Multiscale Analysis for Clustering in Strong Noise Settings With Varying Densities , 2018, IEEE Access.

[20]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[21]  Oguz Altun,et al.  Fuzzy Neighborhood Grid-Based DBSCAN Using Representative Points , 2016, ICDM 2016.

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Ajiboye Adeleke Raheem,et al.  Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms , 2014 .

[24]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[25]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[26]  C. A. Murthy,et al.  In search of optimal clusters using genetic algorithms , 1996, Pattern Recognit. Lett..

[27]  Wang Peng,et al.  Grid-based DBSCAN Algorithm with Referential Parameters , 2012 .

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Efendi N. Nasibov,et al.  Fuzzy and crisp clustering methods based on the neighborhood concept: A comprehensive review , 2012, J. Intell. Fuzzy Syst..

[30]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[31]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[32]  Peter Grabusts,et al.  Using grid-clustering methods in data classification , 2002, Proceedings. International Conference on Parallel Computing in Electrical Engineering.

[33]  Rafael Caballero,et al.  Scatter tabu search for multiobjective clustering problems , 2011, J. Oper. Res. Soc..

[34]  David F. Barrero,et al.  A Genetic Graph-Based Approach for Partitional Clustering , 2014, Int. J. Neural Syst..

[35]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[36]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[37]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[38]  Chen Xiaoyun,et al.  GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid , 2008, ICEBE.

[39]  Bing Liu,et al.  A Fast Density-Based Clustering Algorithm for Large Databases , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[40]  Witold Pedrycz,et al.  Knowledge-based clustering - from data to information granules , 2007 .

[41]  Lior Rokach,et al.  A survey of Clustering Algorithms , 2010, Data Mining and Knowledge Discovery Handbook.

[42]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[43]  Michael Stonebraker,et al.  The SEQUOIA 2000 storage benchmark , 1993, SIGMOD '93.

[44]  Howard J. Hamilton,et al.  DBRS: A Density-Based Spatial Clustering Method with Random Sampling , 2003, PAKDD.

[45]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .