The reproducing Stein kernel approach for post-hoc corrected sampling

Stein importance sampling is a widely applicable technique based on kernelized Stein discrepancy, which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence-determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  V. Sudakov,et al.  Gram-de finetti matrices , 1984 .

[3]  C. Stein Approximate computation of expectations , 1986 .

[4]  A. Barbour Stein's method and poisson process convergence , 1988, Journal of Applied Probability.

[5]  Olav Kallenberg,et al.  On the representation theorem for exchangeable arrays , 1989 .

[6]  Stefun D. Leigh U-Statistics Theory and Practice , 1992 .

[7]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[8]  S. Meyn,et al.  Exponential and Uniform Ergodicity of Markov Processes , 1995 .

[9]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[10]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[11]  Sean P. Meyn,et al.  A Liapounov bound for solutions of the Poisson equation , 1996 .

[12]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[13]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[16]  K. Miller,et al.  Completely monotonic functions , 2001 .

[17]  T. C. Brown,et al.  Stein's Method and Birth-Death Processes , 2001 .

[18]  Niels Richard Hansen Geometric ergodicity of discrete-time approximations to multivariate diffusions , 2003 .

[19]  J. Rosenthal,et al.  General state space Markov chains and MCMC algorithms , 2004, math/0404033.

[20]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[21]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[22]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[23]  A. Skopenkov Surveys in Contemporary Mathematics: Embedding and knotting of manifolds in Euclidean spaces , 2006, math/0604045.

[24]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[25]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[26]  C. Villani Optimal Transport: Old and New , 2008 .

[27]  On the Dovbysh-Sudakov representation result , 2009, 0905.1524.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[30]  Dirk P. Kroese,et al.  Handbook of Monte Carlo Methods , 2011 .

[31]  A simple variance inequality for U-statistics of a Markov chain with applications ✩ , 2011, 1107.2576.

[32]  Nathan Ross Fundamentals of Stein's method , 2011, 1109.1880.

[33]  Regular Perturbation of V-Geometrically Ergodic Markov Chains , 2013, Journal of Applied Probability.

[34]  Martin Hairer,et al.  Exponential ergodicity for Markov processes with random switching , 2013, 1303.6999.

[35]  Feng-Yu Wang Analysis for Diffusion Processes on Riemannian Manifolds , 2013 .

[36]  James Ledoux,et al.  Regular Perturbation of V-Geometrically Ergodic Markov Chains , 2012, Journal of Applied Probability.

[37]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[38]  N. Chopin,et al.  Control functionals for Monte Carlo integration , 2014, 1410.2392.

[39]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[40]  P. Cattiaux,et al.  Semi Log-Concave Markov Diffusions , 2013, 1303.6884.

[41]  G. Reinert,et al.  Stein's method for comparison of univariate distributions , 2014, 1408.2998.

[42]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[43]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[44]  Marcelo Pereyra,et al.  Proximal Markov chain Monte Carlo algorithms , 2013, Statistics and Computing.

[45]  Qiang Liu,et al.  Learning to Draw Samples: With Application to Amortized MLE for Generative Adversarial Learning , 2016, ArXiv.

[46]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[47]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[48]  Arthur Gretton,et al.  A Kernel Test of Goodness of Fit , 2016, ICML.

[49]  Zhe Gan,et al.  VAE Learning via Stein Variational Gradient Descent , 2017, NIPS.

[50]  Lester W. Mackey,et al.  Measuring Sample Quality with Kernels , 2017, ICML.

[51]  Qiang Liu,et al.  Black-box Importance Sampling , 2016, AISTATS.

[52]  Qiang Liu,et al.  Two Methods for Wild Variational Inference , 2016, 1612.00081.

[53]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[54]  Qiang Liu,et al.  Goodness-of-fit Testing for Discrete Distributions via Stein Discrepancy , 2018, ICML.

[55]  Tiangang Cui,et al.  A Stein variational Newton method , 2018, NeurIPS.

[56]  R. Kohn,et al.  Speeding Up MCMC by Efficient Data Subsampling , 2014, Journal of the American Statistical Association.

[57]  D. Nott,et al.  Gaussian Variational Approximation With a Factor Covariance Structure , 2017, Journal of Computational and Graphical Statistics.

[58]  M. Girolami,et al.  A Riemannian-Stein Kernel method , 2018 .

[59]  Chang Liu,et al.  Riemannian Stein Variational Gradient Descent for Bayesian Inference , 2017, AAAI.

[60]  Dilin Wang,et al.  Stein Variational Message Passing for Continuous Graphical Models , 2017, ICML.

[61]  Mahantapas Kundu,et al.  The journey of graph kernels through two decades , 2018, Comput. Sci. Rev..

[62]  Lester W. Mackey,et al.  Stein Points , 2018, ICML.

[63]  Martin J. Wainwright,et al.  Log-concave sampling: Metropolis-Hastings algorithms are fast! , 2018, COLT.

[64]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[65]  Rajesh Ranganath,et al.  Kernelized Complete Conditional Stein Discrepancy , 2019, ArXiv.

[66]  Lester W. Mackey,et al.  Measuring Sample Quality with Diffusions , 2016, The Annals of Applied Probability.

[67]  Giacomo Zanella,et al.  Informed Proposals for Local MCMC in Discrete Spaces , 2017, Journal of the American Statistical Association.

[68]  Franccois-Xavier Briol,et al.  Stein Point Markov Chain Monte Carlo , 2019, ICML.

[69]  Kenji Fukumizu,et al.  A Kernel Stein Test for Comparing Latent Variable Models , 2019, Journal of the Royal Statistical Society Series B: Statistical Methodology.

[70]  Thomas A. Courtade,et al.  Existence of Stein Kernels under a Spectral Gap, and Discrepancy Bound , 2017, Annales de l'Institut Henri Poincaré, Probabilités et Statistiques.

[71]  Nal Kalchbrenner,et al.  Bayesian Inference for Large Scale Image Classification , 2019, ArXiv.

[72]  M. Girolami,et al.  Convergence rates for a class of estimators based on Stein’s method , 2016, Bernoulli.

[73]  Alain Durmus,et al.  High-dimensional Bayesian inference via the unadjusted Langevin algorithm , 2016, Bernoulli.

[74]  Jacob Vorstrup Goldman,et al.  Accelerated Sampling on Discrete Spaces with Non-Reversible Markov Processes , 2019, 1912.04681.

[75]  É. Moulines,et al.  The tamed unadjusted Langevin algorithm , 2017, Stochastic Processes and their Applications.

[76]  Fred Roosta,et al.  Implicit Langevin Algorithms for Sampling From Log-concave Densities , 2019, J. Mach. Learn. Res..