Statistical Inference for Generative Models with Maximum Mean Discrepancy

While likelihood-based inference and its variants provide a statistically efficient and widely applicable approach to parametric inference, their application to models involving intractable likelihoods poses challenges. In this work, we study a class of minimum distance estimators for intractable generative models, that is, statistical models for which the likelihood is intractable, but simulation is cheap. The distance considered, maximum mean discrepancy (MMD), is defined through the embedding of probability measures into a reproducing kernel Hilbert space. We study the theoretical properties of these estimators, showing that they are consistent, asymptotically normal and robust to model misspecification. A main advantage of these estimators is the flexibility offered by the choice of kernel, which can be used to trade-off statistical efficiency and robustness. On the algorithmic side, we study the geometry induced by MMD on the parameter space and use this to introduce a novel natural gradient descent-like algorithm for efficient implementation of these estimators. We illustrate the relevance of our theoretical results on several classes of models including a discrete-time latent Markov process and two multivariate stochastic differential equation models.

[1]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[2]  Paul Fearnhead,et al.  Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC , 2010, 1004.1112.

[3]  Wuchen Li,et al.  Optimal transport natural gradient for statistical manifolds with continuous sample space , 2018 .

[4]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[5]  D. Burago,et al.  A Course in Metric Geometry , 2001 .

[6]  Maxim Raginsky,et al.  Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[7]  Arthur Gretton,et al.  Fast Two-Sample Testing with Analytic Representations of Probability Measures , 2015, NIPS.

[8]  Barnabás Póczos,et al.  On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives , 2014, ArXiv.

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[11]  Zaïd Harchaoui,et al.  A Fast, Consistent Kernel Two-Sample Test , 2009, NIPS.

[12]  Arthur Gretton,et al.  On gradient regularizers for MMD GANs , 2018, NeurIPS.

[13]  Le Song,et al.  Kernel Bayes' rule: Bayesian inference with positive definite kernels , 2013, J. Mach. Learn. Res..

[14]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[15]  A. Dawid,et al.  Theory and applications of proper scoring rules , 2014, 1401.0398.

[16]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[17]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[18]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[19]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[20]  A. W. van der Vaart,et al.  On Profile Likelihood , 2000 .

[21]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[22]  A. Papadopoulos Metric Spaces, Convexity and Nonpositive Curvature , 2004 .

[23]  Johanna F. Ziegel,et al.  Strictly proper kernel scores and characteristic kernels on compact spaces , 2017, Applied and Computational Harmonic Analysis.

[24]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[25]  A. Friedman Stochastic Differential Equations and Applications , 1975 .

[26]  Petr Hájek,et al.  Smooth analysis in Banach spaces , 2014 .

[27]  Jun Zhu,et al.  Conditional Generative Moment-Matching Networks , 2016, NIPS.

[28]  Yifan Chen,et al.  Natural gradient in Wasserstein statistical manifold , 2018, ArXiv.

[29]  Grigorios A. Pavliotis,et al.  Multiscale Methods: Averaging and Homogenization , 2008 .

[30]  F. Hampel A General Qualitative Definition of Robustness , 1971 .

[31]  E. L. Lehmann,et al.  Consistency and Unbiasedness of Certain Nonparametric Tests , 1951 .

[32]  Gert R. G. Lanckriet,et al.  On the empirical estimation of integral probability metrics , 2012 .

[33]  H. Robbins A Stochastic Approximation Method , 1951 .

[34]  Klaus-Robert Müller,et al.  Wasserstein Training of Boltzmann Machines , 2015, ArXiv.

[35]  Jonathan Weed,et al.  Estimation of smooth densities in Wasserstein distance , 2019, COLT.

[36]  V. Volterra Fluctuations in the Abundance of a Species considered Mathematically , 1926 .

[37]  Antonio Cuevas,et al.  Qualitative robustness in abstract inference , 1988 .

[38]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[39]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[40]  Masato Okada,et al.  Adaptive Natural Gradient Learning Algorithms for Unnormalized Statistical Models , 2016, ICANN.

[41]  J. Møller,et al.  An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants , 2006 .

[42]  O. Barndorff-Nielsen Information And Exponential Families , 1970 .

[43]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[44]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[45]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[46]  David Lopez-Paz,et al.  Geometrical Insights for Implicit Generative Modeling , 2017, Braverman Readings in Machine Learning.

[47]  Alexander J. Smola,et al.  Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy , 2016, ICLR.

[48]  A. Guillin,et al.  On the rate of convergence in Wasserstein distance of the empirical measure , 2013, 1312.2128.

[49]  Wittawat Jitkrittum,et al.  Large sample analysis of the median heuristic , 2017, 1707.07269.

[50]  Jean-Michel Marin,et al.  Approximate Bayesian computational methods , 2011, Statistics and Computing.

[51]  V. Volterra Fluctuations in the Abundance of a Species considered Mathematically , 1926, Nature.

[52]  Tomaso A. Poggio,et al.  Approximate inference with Wasserstein gradient flows , 2018, AISTATS.

[53]  Dennis Prangle,et al.  gk: An R Package for the g-and-k and Generalised g-and-h Distributions , 2017, R J..

[54]  D. Pollard,et al.  $U$-Processes: Rates of Convergence , 1987 .

[55]  Cícero Nogueira dos Santos,et al.  Learning Implicit Generative Models by Matching Perceptual Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[57]  Barnabás Póczos,et al.  Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing , 2015, ArXiv.

[58]  L. Campbell An extended Čencov characterization of the information metric , 1986 .

[59]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[60]  A. Basu,et al.  Statistical Inference: The Minimum Distance Approach , 2011 .

[61]  G. A. Pavliotis,et al.  Parameter Estimation for Multiscale Diffusions , 2007 .

[62]  N. Shephard,et al.  Stochastic Volatility: Likelihood Inference And Comparison With Arch Models , 1996 .

[63]  Frances Y. Kuo,et al.  High-dimensional integration: The quasi-Monte Carlo way*† , 2013, Acta Numerica.

[64]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[65]  Wuchen Li,et al.  Natural gradient via optimal transport , 2018, Information Geometry.

[66]  Sebastian Krumscheid,et al.  Perturbation-based inference for diffusion processes: Obtaining effective models from multiscale data , 2014, Mathematical Models and Methods in Applied Sciences.

[67]  Michael A. Osborne,et al.  Probabilistic Integration: A Role in Statistical Computation? , 2015, Statistical Science.

[68]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[69]  Nihat Ay,et al.  On the Fisher Metric of Conditional Probability Polytopes , 2014, Entropy.

[70]  G. Roberts,et al.  Exact simulation of diffusions , 2005, math/0602523.

[71]  Jason D. Lee,et al.  On the Convergence and Robustness of Training GANs with Regularized Optimal Transport , 2018, NeurIPS.

[72]  R. G. Jarrett,et al.  Bounds and expansions for Fisher information when the moments are known , 1984 .

[73]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[74]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[75]  Wittawat Jitkrittum,et al.  K2-ABC: Approximate Bayesian Computation with Kernel Embeddings , 2015, AISTATS.

[76]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .

[77]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[78]  Philippe Wenk,et al.  AReS and MaRS - Adversarial and MMD-Minimizing Regression for SDEs , 2019, ICML.

[79]  Andrew J. Majda,et al.  A mathematical framework for stochastic climate models , 2001 .

[80]  Josef A. Nossek,et al.  A Pessimistic Approximation for the Fisher Information Measure , 2015, IEEE Transactions on Signal Processing.

[81]  Zoubin Ghahramani,et al.  MCMC for Doubly-intractable Distributions , 2006, UAI.

[82]  E. Candès,et al.  Deep Knockoffs , 2018, Journal of the American Statistical Association.

[83]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[84]  Z. Q. Lu Statistical Inference Based on Divergence Measures , 2007 .

[85]  A. N. Pettitt,et al.  Approximate Bayesian Computation for astronomical model analysis: a case study in galaxy demographics and morphological transformation at high redshift , 2012, 1202.1426.

[86]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[87]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[88]  M. Gutmann,et al.  Fundamentals and Recent Developments in Approximate Bayesian Computation , 2016, Systematic biology.

[89]  S. Lang Fundamentals of differential geometry , 1998 .

[90]  Mathieu Gerber,et al.  Approximate Bayesian computation with the Wasserstein distance , 2019, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[91]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[92]  E. Weinan Principles of Multiscale Modeling , 2011 .

[93]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[94]  S. Wood Statistical inference for noisy nonlinear ecological dynamic systems , 2010, Nature.

[95]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[96]  Natalie Neumeyer,et al.  A central limit theorem for two-sample U-processes , 2004 .

[97]  Hamed Masnadi-Shirazi Strictly Proper Kernel Scoring Rules and Divergences with an Application to Kernel Two-Sample Hypothesis Testing , 2017, 1704.02578.

[98]  Aapo Hyvärinen,et al.  Some extensions of score matching , 2007, Comput. Stat. Data Anal..

[99]  N. Reid,et al.  AN OVERVIEW OF COMPOSITE LIKELIHOOD METHODS , 2011 .

[100]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[101]  Alastair R. Hall,et al.  Generalized Method of Moments , 2005 .

[102]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[103]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[104]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[105]  A. Dawid The geometry of proper scoring rules , 2007 .

[106]  I. Kevrekidis,et al.  An equation-free computational approach for extracting population-level behavior from individual-based models of biological dispersal , 2005, physics/0505179.

[107]  V. P. Godambe An Optimum Property of Regular Maximum Likelihood Estimation , 1960 .

[108]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[109]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[110]  Yann Ollivier,et al.  Online natural gradient as a Kalman filter , 2017, 1703.00209.

[111]  Sébastien Lahaie,et al.  Nonparametric Scoring Rules , 2015, AAAI.