Improved Coresets for Kernel Density Estimates

We study the construction of coresets for kernel density estimates. That is we show how to approximate the kernel density estimate described by a large point set with another kernel density estimate with a much smaller point set. For characteristic kernels (including Gaussian and Laplace kernels), our approximation preserves the $L_\infty$ error between kernel density estimates within error $\epsilon$, with coreset size $2/\epsilon^2$, but no other aspects of the data, including the dimension, the diameter of the point set, or the bandwidth of the kernel common to other approximations. When the dimension is unrestricted, we show this bound is tight for these kernels as well as a much broader set. This work provides a careful analysis of the iterative Frank-Wolfe algorithm adapted to this context, an algorithm called \emph{kernel herding}. This analysis unites a broad line of work that spans statistics, machine learning, and geometry. When the dimension $d$ is constant, we demonstrate much tighter bounds on the size of the coreset specifically for Gaussian kernels, showing that it is bounded by the size of the coreset for axis-aligned rectangles. Currently the best known constructive bound is $O(\frac{1}{\epsilon} \log^d \frac{1}{\epsilon})$, and non-constructively, this can be improved by $\sqrt{\log \frac{1}{\epsilon}}$. This improves the best constant dimension bounds polynomially for $d \geq 3$.

[1]  Le Song,et al.  Tailoring density estimation via reproducing kernel moment matching , 2008, ICML '08.

[2]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[3]  Aleksandar Nikolov,et al.  Tighter Bounds for the Discrepancy of Boxes and Polytopes , 2017, ArXiv.

[4]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[5]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[6]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[7]  Paul Grigas,et al.  New analysis and results for the Frank–Wolfe method , 2013, Mathematical Programming.

[8]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[9]  Suresh Venkatasubramanian,et al.  A Gentle Introduction to the Kernel Distance , 2011, ArXiv.

[10]  Clayton Scott,et al.  Sparse Approximation of a Kernel Mean , 2015, IEEE Transactions on Signal Processing.

[11]  Alexander J. Smola,et al.  Who Supported Obama in 2012?: Ecological Inference through Distribution Regression , 2015, KDD.

[12]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[13]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[14]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[15]  Sivaraman Balakrishnan,et al.  Confidence sets for persistence diagrams , 2013, The Annals of Statistics.

[16]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[17]  Bei Wang,et al.  Geometric Inference on Kernel Density Estimates , 2013, SoCG.

[18]  G. Wahba Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV , 1999 .

[19]  Luc Devroye,et al.  Nonparametric Density Estimation , 1985 .

[20]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[21]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[22]  Jeff M. Phillips,et al.  Є-Samples for Kernels , 2013, SODA.

[23]  Samira Samadi,et al.  Near-Optimal Herding , 2014, COLT.

[24]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[25]  Yan Zheng,et al.  L∞ Error and Bandwidth Selection for Kernel Density Estimates of Large Data , 2015, KDD.

[26]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[27]  Nikhil Bansal,et al.  Algorithmic discrepancy beyond partial coloring , 2016, STOC.

[28]  Suresh Venkatasubramanian,et al.  Comparing distributions and shapes using the kernel distance , 2010, SoCG '11.

[29]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[30]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[31]  Martin Jaggi,et al.  Coresets for polytope distance , 2009, SCG '09.

[32]  Zaïd Harchaoui,et al.  Signal Processing , 2013, 2020 27th International Conference on Mixed Design of Integrated Circuits and System (MIXDES).

[33]  Francis R. Bach,et al.  On the Equivalence between Herding and Conditional Gradient Algorithms , 2012, ICML.

[34]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[35]  Joan Alexis Glaunès Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l'anatomie numérique , 2005 .

[36]  J. Dunn Convergence Rates for Conditional Gradient Sequences Generated by Implicit Step Length Rules , 1980 .