论文信息 - Pruning nearest neighbor cluster trees

Pruning nearest neighbor cluster trees

Nearest neighbor (k-NN) graphs are widely used in machine learning and data mining applications, and our aim is to better understand what they reveal about the cluster structure of the unknown underlying distribution of points. Moreover, is it possible to identify spurious structures that might arise due to sampling variability? Our first contribution is a statistical analysis that reveals how certain subgraphs of a k-NN graph form a consistent estimator of the cluster tree of the underlying distribution of points. Our second and perhaps most important contribution is the following finite sample guarantee. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious cluster structures at all levels of the tree while at the same time guaranteeing the recovery of salient clusters. This is the first such finite sample result in the context of clustering.

Ulrike von Luxburg | Samory Kpotufe | U. V. Luxburg | Samory Kpotufe

[1] W. Stuetzle,et al. Clustering with Confidence: A Binning Approach , 2008 .

[2] Gábor Lugosi,et al. Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[3] Ulrike von Luxburg,et al. Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters , 2009, Theor. Comput. Sci..

[4] W. Stuetzle,et al. A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .

[5] P. Rigollet,et al. Fast rates for plug-in estimators of density level sets , 2008 .

[6] Robert D. Nowak,et al. Adaptive Hausdorff Estimation of Density Level Sets , 2009, COLT.

[7] Sanjoy Dasgupta,et al. Rates of convergence for the cluster tree , 2010, NIPS.

[8] J. Hartigan. Consistency of Single Linkage for High-Density Clusters , 1981 .

[9] Leslie G. Valiant,et al. Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[10] A. Rinaldo,et al. Generalized density clustering , 2009, 0907.3454.

[11] L. Devroye,et al. The Strong Uniform Consistency of Nearest Neighbor Density Estimates. , 1977 .

[12] Rebecca Nugent,et al. Stability of density-based clustering , 2010, J. Mach. Learn. Res..