RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates

A new density-based clustering algorithm, <italic>RNN-DBSCAN</italic>, is presented which uses reverse nearest neighbor counts as an estimate of observation density. Clustering is performed using a <italic>DBSCAN</italic>-like approach based on <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="bryant-ieq1-2787640.gif"/></alternatives></inline-formula> nearest neighbor graph traversals through dense observations. <italic>RNN-DBSCAN</italic> is preferable to the popular density-based clustering algorithm <italic>DBSCAN</italic> in two aspects. First, problem complexity is reduced to the use of a single parameter (choice of <inline-formula><tex-math notation="LaTeX">$k$</tex-math><alternatives> <inline-graphic xlink:href="bryant-ieq2-2787640.gif"/></alternatives></inline-formula> nearest neighbors), and second, an improved ability for handling large variations in cluster density (heterogeneous density). The superiority of <italic>RNN-DBSCAN</italic> is demonstrated on several artificial and real-world datasets with respect to prior work on reverse nearest neighbor based clustering approaches (<italic>RECORD</italic>, <italic>IS-DBSCAN</italic>, and <italic> ISB-DBSCAN</italic>) along with <italic>DBSCAN</italic> and <italic>OPTICS</italic>. Each of these clustering approaches is described by a common graph-based interpretation wherein clusters of dense observations are defined as connected components, along with a discussion on their computational complexity. Heuristics for <italic>RNN-DBSCAN </italic> parameter selection are presented, and the effects of <inline-formula><tex-math notation="LaTeX">$k$ </tex-math><alternatives><inline-graphic xlink:href="bryant-ieq3-2787640.gif"/></alternatives></inline-formula> on <italic>RNN-DBSCAN</italic> clusterings discussed. Additionally, with respect to scalability, an approximate version of <italic>RNN-DBSCAN</italic> is presented leveraging an existing approximate <inline-formula><tex-math notation="LaTeX"> $k$</tex-math><alternatives><inline-graphic xlink:href="bryant-ieq4-2787640.gif"/></alternatives></inline-formula> nearest neighbor technique.

[1]  Tinghuai Ma,et al.  An efficient and scalable density-based clustering algorithm for datasets with complex structures , 2016, Neurocomputing.

[2]  Andrew Zisserman,et al.  Texture classification: are filter banks necessary? , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[3]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[6]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[7]  Lian Duan,et al.  A Local Density Based Spatial Clustering Algorithm with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[8]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[9]  Barton P. Miller,et al.  Mr. Scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[12]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[13]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[14]  Jian-Huang Lai,et al.  APSCAN: A parameter free algorithm for clustering , 2011, Pattern Recognit. Lett..

[15]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[16]  Yufei Tao,et al.  On the Hardness and Approximation of Euclidean DBSCAN , 2017, ACM Trans. Database Syst..

[17]  Joshua D. Knowles,et al.  Fifty years of pulsar candidate selection: from simple filters to a new principled real-time classification approach , 2016, Monthly Notices of the Royal Astronomical Society.

[18]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[19]  Roberto Trasarti,et al.  TOSCA: two-steps clustering algorithm for personal locations detection , 2015, SIGSPATIAL/GIS.

[20]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[21]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[22]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[23]  Matteo Dell'Amico,et al.  NG-DBSCAN: Scalable Density-Based Clustering for Arbitrary Data , 2016, Proc. VLDB Endow..

[24]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[25]  Levent Ertoz,et al.  A New Shared Nearest Neighbor Clustering Algorithm and its Applications , 2002 .

[26]  Yufei Tao,et al.  DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation , 2015, SIGMOD Conference.

[27]  L. Hubert,et al.  Comparing partitions , 1985 .

[28]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[29]  Alfredo Pulvirenti,et al.  DBStrata: a system for density-based clustering and outlier detection based on stratification , 2011, SISAP.

[30]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[31]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[32]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[33]  Alfredo Ferro,et al.  Enhancing density-based clustering: Parameter reduction and outlier detection , 2013, Inf. Syst..

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..