Efficient Computation and Visualization of Multiple Density-Based Clustering Hierarchies

HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset <italic>w.r.t.</italic> a parameter <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq1-2962412.gif"/></alternatives></inline-formula>. While a small change in <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq2-2962412.gif"/></alternatives></inline-formula> typically leads to a small change in the clustering structure, choosing a “good” <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq3-2962412.gif"/></alternatives></inline-formula> value can be challenging: depending on the data distribution, a high or low <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq4-2962412.gif"/></alternatives></inline-formula> value may be more appropriate, and certain clusters may reveal themselves at different values. To explore results for a range of <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq5-2962412.gif"/></alternatives></inline-formula> values, one has to run HDBSCAN* for each value independently, which can be computationally impractical. In this paper, we propose an approach to efficiently compute <italic>all</italic> HDBSCAN* hierarchies for a <italic>range</italic> of <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq6-2962412.gif"/></alternatives></inline-formula> values by building upon results from computational geometry to replace HDBSCAN*’s complete graph with a smaller equivalent graph. An experimental evaluation shows that our approach can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about twice, which corresponds to a speedup of more than 60 times, compared to running HDBSCAN* independently that many times. We also propose a series of visualizations that allow users to analyze a collection of hierarchies for a range of <inline-formula><tex-math notation="LaTeX">$mpts$</tex-math><alternatives><mml:math><mml:mrow><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="cavalcantearaujoneto-ieq7-2962412.gif"/></alternatives></inline-formula> values, along with case studies that illustrate how these analyses are performed.

[1]  Ricardo J. G. B. Campello,et al.  Efficient Computation of Multiple Density-Based Clustering Hierarchies , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[2]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[3]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[4]  Ronnie Johansson,et al.  Choosing DBSCAN Parameters Automatically using Differential Evolution , 2014 .

[5]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[6]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[7]  Arthur Zimek,et al.  A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies , 2013, Data Mining and Knowledge Discovery.

[8]  Ricardo J. G. B. Campello,et al.  Automatic aspect discrimination in data clustering , 2012, Pattern Recognit..

[9]  Z. Elouedi,et al.  DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques , 2012, 2012 IEEE 16th International Conference on Intelligent Engineering Systems (INES).

[10]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Efficiency issues of evolutionary k-means , 2011, Appl. Soft Comput..

[11]  Claudio Carpineto,et al.  Optimal meta search results clustering , 2010, SIGIR.

[12]  Chen Xiaoyun,et al.  GMDBSCAN: Multi-Density DBSCAN Cluster Based on Grid , 2008, ICEBE.

[13]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[15]  Umberto Ferraro Petrillo,et al.  Maintaining dynamic minimum spanning trees: An experimental study , 2002, Discret. Appl. Math..

[16]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[17]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[18]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[19]  Monika Henzinger,et al.  Maintaining Minimum Spanning Trees in Dynamic Graphs , 1997, ICALP.

[20]  Giuseppe Cattaneo,et al.  Experimental analysis of dynamic minimum spanning tree algorithms , 1997, SODA '97.

[21]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[22]  Paul B. Callahan,et al.  Dealing with higher dimensions: the well-separated pair decomposition and its applications , 1995 .

[23]  S. Rao Kosaraju,et al.  A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[24]  Jirí Matousek,et al.  Relative neighborhood graphs in three dimensions , 1992, SODA '92.

[25]  Godfried T. Toussaint,et al.  Relative neighborhood graphs and their relatives , 1992, Proc. IEEE.

[26]  D. Kirkpatrick,et al.  A Framework for Computational Morphology , 1985 .

[27]  D. Matula,et al.  Properties of Gabriel Graphs Relevant to Geographic Variation Research and the Clustering of Points in the Plane , 2010 .

[28]  Godfried T. Toussaint,et al.  The relative neighbourhood graph of a finite planar set , 1980, Pattern Recognit..

[29]  R. Sokal,et al.  A New Statistical Approach to Geographic Variation Analysis , 1969 .

[30]  Cluster generators for large high-dimensional data sets with large numbers of clusters , .