A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density

The goal of clustering is to detect the presence of distinct groups in a dataset and assign group labels to the observations. Nonparametric clustering is based on the premise that the observations may be regarded as a sample from some underlying density in feature space and that groups correspond to modes of this density. The goal then is to find the modes and assign each observation to the domain of attraction of a mode. The modal structure of a density is summarized by its cluster tree; modes of the density correspond to leaves of the cluster tree. Estimating the cluster tree is the primary goal of nonparametric cluster analysis. We adopt a plug-in approach to cluster tree estimation: estimate the cluster tree of the feature density by the cluster tree of a density estimate. For some density estimates the cluster tree can be computed exactly; for others we have to be content with an approximation. We present a graph-based method that can approximate the cluster tree of any density estimate. Density estimates tend to have spurious modes caused by sampling variability, leading to spurious branches in the graph cluster tree. We propose excess mass as a measure for the size of a branch, reflecting the height of the corresponding peak of the density above the surrounding valley floor as well as its spatial extent. Excess mass can be used as a guide for pruning the graph cluster tree. We point out mathematical and algorithmic connections to single linkage clustering and illustrate our approach on several examples. Supplemental materials for the article, including an R package implementing generalized single linkage clustering, all datasets used in the examples, and R code producing the figures and numerical results, are available online.

[1]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[2]  J. Carmichael,et al.  FINDING NATURAL CLUSTERS , 1968 .

[3]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[4]  J. Hartigan Consistency of Single Linkage for High-Density Clusters , 1981 .

[5]  J. Friedman,et al.  PROJECTION PURSUIT DENSITY ESTIMATION , 1984 .

[6]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[7]  J. Hartigan Estimation of a Convex Density Contour in Two Dimensions , 1987 .

[8]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[9]  G. Sawitzki,et al.  Excess Mass Estimates and Tests for Multimodality , 1991 .

[10]  J. Hartigan,et al.  The runt test for multimodality , 1992 .

[11]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[12]  W. Polonik Measuring Mass Concentrations and Estimating Density Contour Clusters-An Excess Mass Approach , 1995 .

[13]  J. Shao,et al.  The jackknife and bootstrap , 1996 .

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[16]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[17]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[18]  A. Cuevas,et al.  Estimating the number of clusters , 2000 .

[19]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[20]  A. Cuevas,et al.  Cluster analysis: a further approach based on density estimation , 2001 .

[21]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[22]  Werner Stuetzle,et al.  Estimating the Cluster Tree of a Density by Analyzing the Minimal Spanning Tree of a Sample , 2003, J. Classif..

[23]  Jussi Klemelä,et al.  Visualization of Multivariate Density Estimates With Level Set Trees , 2004 .

[24]  Jussi Klemelä,et al.  Algorithms for manipulation of level sets of nonparametric density estimates , 2005, Comput. Stat..

[25]  Algorithms for estimating the cluster tree of a density , 2006 .

[26]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[27]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[28]  W. Stuetzle,et al.  On Potts Model Clustering, Kernel K-Means and Density Estimation , 2008 .