Recovery Guarantees for Kernel-based Clustering under Non-parametric Mixture Models

Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based dataclustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of nonparametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.

[1]  Clayton D. Scott,et al.  On The Identifiability of Mixture Models from Grouped Samples , 2015, ArXiv.

[2]  Debarghya Ghoshdastidar,et al.  On the optimality of kernels for high-dimensional clustering , 2019, AISTATS.

[3]  C. Bruni,et al.  Identifiability of Continuous Mixtures of Unknown Gaussian Distributions , 1985 .

[4]  Ingo Steinwart,et al.  Consistency and Rates for Clustering with DBSCAN , 2012, AISTATS.

[5]  J. Hartigan Consistency of Single Linkage for High-Density Clusters , 1981 .

[6]  Bin Yu,et al.  The geometry of kernelized spectral clustering , 2014, 1404.7552.

[7]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[8]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[9]  X. Nguyen Convergence of latent mixing measures in finite and infinite mixture models , 2011, 1109.3250.

[10]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[11]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[12]  E. Giné,et al.  Rates of strong uniform consistency for multivariate kernel density estimators , 2002 .

[13]  T. Duong,et al.  Data-driven density derivative estimation, with applications to nonparametric clustering and bump hunting , 2012, 1204.6160.

[14]  M. Urner Scattered Data Approximation , 2016 .

[15]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[16]  Hajo Holzmann,et al.  Identifiability of Finite Mixtures of Elliptical Distributions , 2006 .

[17]  Uwe Einmahl,et al.  Uniform in bandwidth consistency of kernel-type function estimators , 2005 .

[18]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[19]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[20]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[21]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[22]  Ulrike von Luxburg,et al.  Consistent Procedures for Cluster Tree Estimation and Pruning , 2014, IEEE Transactions on Information Theory.

[23]  A. Rinaldo,et al.  Generalized density clustering , 2009, 0907.3454.

[24]  S. Yakowitz,et al.  On the Identifiability of Finite Mixtures , 1968 .

[25]  R. Couillet,et al.  Kernel spectral clustering of large dimensional data , 2015, 1510.03547.

[26]  A. Goldenshluger,et al.  Bandwidth selection in kernel density estimation: Oracle inequalities and adaptive minimax optimality , 2010, 1009.1016.

[27]  Purnamrita Sarkar,et al.  On Robustness of Kernel Clustering , 2016, NIPS.

[28]  Pradeep Ravikumar,et al.  Identifiability of Nonparametric Mixture Models and Bayes Optimal Clustering , 2018, The Annals of Statistics.