Accelerated Hierarchical Density Based Clustering

We present an accelerated algorithm for hierarchical density based clustering. Our new algorithm improves upon HDBSCAN*, which itself provided a significant qualitative improvement over the popular DBSCAN algorithm. The accelerated HDBSCAN* algorithm provides comparable performance to DBSCAN, while supporting variable density clusters, and eliminating the need for the difficult to tune distance scale parameter epsilon. This makes accelerated HDBSCAN* the default choice for density based clustering.

[1]  R. Prim Shortest connection networks and some generalizations , 1957 .

[2]  William B. March,et al.  Linear-time Algorithms for Pairwise Statistical Problems , 2009, NIPS.

[3]  J. Hartigan Consistency of Single Linkage for High-Density Clusters , 1981 .

[4]  H. Edelsbrunner,et al.  Topological data analysis , 2011 .

[5]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[6]  William B. March,et al.  Tree-Independent Dual-Tree Algorithms , 2013, ICML.

[7]  Facundo Mémoli,et al.  Classifying Clustering Schemes , 2010, Foundations of Computational Mathematics.

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[10]  G. Sawitzki,et al.  Excess Mass Estimates and Tests for Multimodality , 1991 .

[11]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[12]  A. Rinaldo,et al.  Generalized density clustering , 2009, 0907.3454.

[13]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[14]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[15]  Hans-Peter Kriegel,et al.  The (black) art of runtime evaluation: Are we comparing algorithms or implementations? , 2017, Knowledge and Information Systems.

[16]  J. Hartigan Estimation of a Convex Density Contour in Two Dimensions , 1987 .

[17]  William B. March,et al.  Fast euclidean minimum spanning tree: algorithm, analysis, and applications , 2010, KDD.

[18]  Afra Zomorodian,et al.  Computing Persistent Homology , 2005, Discret. Comput. Geom..

[19]  H. Edelsbrunner,et al.  Persistent Homology — a Survey , 2022 .

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  W. Ackermann Zum Hilbertschen Aufbau der reellen Zahlen , 1928 .

[22]  Leland McInnes,et al.  hdbscan: Hierarchical density based clustering , 2017, J. Open Source Softw..

[23]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[24]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[25]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[26]  Álvaro Martínez-Pérez,et al.  On the Properties of α-Unchaining Single Linkage Hierarchical Clustering , 2014, Journal of Classification.

[27]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[28]  Sanjoy Dasgupta,et al.  Rates of convergence for the cluster tree , 2010, NIPS.

[29]  I. Moerdijk,et al.  Sheaves in geometry and logic: a first introduction to topos theory , 1992 .

[30]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[31]  Claus Weihs,et al.  Classification as a Tool for Research , 2010 .

[32]  Mikhail Belkin,et al.  Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering , 2015, COLT.

[33]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[34]  Jingzhou Liu,et al.  Visualizing Large-scale and High-dimensional Data , 2016, WWW.

[35]  Ryan R. Curtin Faster Dual-Tree Traversal for Nearest Neighbor Search , 2015, SISAP.

[36]  W. Stuetzle,et al.  A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .

[37]  Ulrike von Luxburg,et al.  Consistent Procedures for Cluster Tree Estimation and Pruning , 2014, IEEE Transactions on Information Theory.

[38]  Álvaro Martínez-Pérez,et al.  A Density-Sensitive Hierarchical Clustering Method , 2012, J. Classif..

[39]  Robert Ghrist,et al.  Elementary Applied Topology , 2014 .

[40]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[41]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[42]  Peter R. Spackman,et al.  High Throughput Profiling of Molecular Shapes in Crystals , 2016, Scientific Reports.

[43]  M. Cugmas,et al.  On comparing partitions , 2015 .

[44]  Robert E. Tarjan,et al.  Efficiency of a Good But Not Linear Set Union Algorithm , 1972, JACM.

[45]  F. Mémoli,et al.  Multiparameter Hierarchical Clustering Methods , 2010 .

[46]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[47]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[48]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[49]  Ryan C. Godwin,et al.  Uncovering Large-Scale Conformational Change in Molecular Dynamics without Prior Knowledge. , 2016, Journal of chemical theory and computation.

[50]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[51]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[52]  Philip N. Klein,et al.  A randomized linear-time algorithm to find minimum spanning trees , 1995, JACM.

[53]  Frédéric Chazal,et al.  Robust Topological Inference: Distance To a Measure and Kernel Distance , 2014, J. Mach. Learn. Res..

[54]  Rebecca Nugent,et al.  Stability of density-based clustering , 2010, J. Mach. Learn. Res..

[55]  Michael Lesnick,et al.  Interactive Visualization of 2-D Persistence Modules , 2015, ArXiv.

[56]  Christian Hennig,et al.  What are the true clusters? , 2015, Pattern Recognit. Lett..

[57]  Evaggelos Spyrou,et al.  Xenia: A context aware tour recommendation system based on social network metadata information , 2016, 2016 11th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP).

[58]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[59]  L. Wasserman Topological Data Analysis , 2016, 1609.08227.

[60]  Andrew T. Wilson,et al.  Exploratory Trajectory Clustering with Distance Geometry , 2016, HCI.