Monte Carlo Gradient Estimation in Machine Learning

This paper is a broad and accessible survey of the methods we have at our disposal for Monte Carlo gradient estimation in machine learning and across the statistical sciences: the problem of computing the gradient of an expectation of a function with respect to parameters defining the distribution that is integrated; the problem of sensitivity analysis. In machine learning research, this gradient problem lies at the core of many learning problems, in supervised, unsupervised and reinforcement learning. We will generally seek to rewrite such gradients in a form that allows for Monte Carlo estimation, allowing them to be easily and efficiently used and analysed. We explore three strategies--the pathwise, score function, and measure-valued gradient estimators--exploring their historical developments, derivation, and underlying assumptions. We describe their use in other fields, show how they are related and can be combined, and expand on their possible generalisations. Wherever Monte Carlo gradient estimators have been derived and deployed in the past, important advances have followed. A deeper and more widely-held understanding of this problem will lead to further advances, and it is these advances that we wish to support.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  Robert Price,et al.  A useful theorem for nonlinear devices having Gaussian inputs , 1958, IRE Trans. Inf. Theory.

[3]  G. Bonnet Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire , 1964 .

[4]  L. B. Miller MONTE CARLO ANALYSIS OF REACTIVITY COEFFICIENTS IN FAST REACTORS GENERAL THEORY AND APPLICATIONS. , 1967 .

[5]  Gordon F. Newell,et al.  Applications of queueing theory , 1971 .

[6]  Harley Flanders,et al.  Differentiation Under the Integral Sign , 1973 .

[7]  F. L. Bauer Computational Graphs and Rounding Error , 1974 .

[8]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[9]  G. Grimmett,et al.  Probability and random processes , 2002 .

[10]  J. Kreimer,et al.  About one Monte Carlo method for solving linear equations , 1983 .

[11]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[12]  M. J. Norman,et al.  Monte Carlo Optimization, Simulation and Sensitivity of Queueing Networks , 1987 .

[13]  Y. Ho,et al.  Smoothed (conditional) perturbation analysis of discrete event dynamical systems , 1987 .

[14]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[15]  Xi-Ren Cao,et al.  Convergence properties of infinitesimal perturbation analysis , 1988 .

[16]  R. Suri,et al.  Perturbation analysis gives strongly consistent sensitivity estimates for the M/G/ 1 queue , 1988 .

[17]  Griewank,et al.  On automatic differentiation , 1988 .

[18]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[19]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[20]  Ward Whitt,et al.  Indirect Estimation Via L = λW , 1989, Oper. Res..

[21]  伊理 正夫,et al.  Mathematical programming : recent developments and applications , 1989 .

[22]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[23]  Paul Glasserman,et al.  Gradient Estimation Via Perturbation Analysis , 1990 .

[24]  Barry L. Nelson,et al.  Control Variate Remedies , 1990, Oper. Res..

[25]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[26]  Lee W. Schruben,et al.  Driving Frequency Selection for Frequency Domain Simulation Experiments , 1991, Oper. Res..

[27]  Reuven Y. Rubinstein,et al.  Sensitivity analysis of discrete event systems by the “push out” method , 1992, Ann. Oper. Res..

[28]  James R. Wilson,et al.  A splitting scheme for control variates , 1993, Oper. Res. Lett..

[29]  M. Fu,et al.  Second Derivative Sample Path Estimators for the GI/G/m Queue , 1993 .

[30]  S. Jacobson Optimal mean squared error analysis of the harmonic gradient estimators , 1994 .

[31]  Sheldon H. Jacobson,et al.  Application of RPA and the harmonic gradient estimators to a priority queueing system , 1994, Proceedings of Winter Simulation Conference.

[32]  Charles Leake,et al.  Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method , 1994 .

[33]  Michael C. Fu,et al.  Optimization via simulation: A review , 1994, Ann. Oper. Res..

[34]  P. L’Ecuyer,et al.  On the interchange of derivative and expectation for likelihood ratio derivative estimators , 1995 .

[35]  Jason H. Goodfriend,et al.  Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization by the Score Function Method , 1995 .

[36]  P. Glynn,et al.  Likelihood ratio gradient estimation for stochastic recursions , 1995, Advances in Applied Probability.

[37]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[38]  D. Nualart The Malliavin Calculus and Related Topics , 1995 .

[39]  N. Chriss Black-Scholes and Beyond: Option Pricing Models , 1996 .

[40]  Luc Devroye Random variate generation in one line of code , 1996, Winter Simulation Conference.

[41]  Jack P. C. Kleijnen,et al.  Optimization and Sensitivity Analysis of Computer Simulation Models by the Score Function Method , 1996 .

[42]  Michael I. Jordan,et al.  A variational approach to Bayesian logistic regression problems and their extensions , 1996 .

[43]  Christian Gourieroux,et al.  Simulation-based econometric methods , 1996 .

[44]  A Orman,et al.  Optimization of Stochastic Models: The Interface Between Simulation and Optimization , 2012, J. Oper. Res. Soc..

[45]  Christos G. Cassandras,et al.  Introduction to Discrete Event Systems , 1999, The Kluwer International Series on Discrete Event Dynamic Systems.

[46]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[47]  Roman Kapuscinski,et al.  Optimal Policies and Simulation-Based Optimization for Capacitated Production Inventory Systems , 1999 .

[48]  S. Jacobson,et al.  A harmonic analysis approach to simulation sensitivity analysis , 1999 .

[49]  Pierre-Louis Lions,et al.  Applications of Malliavin calculus to Monte Carlo methods in finance , 1999, Finance Stochastics.

[50]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[51]  F. Vázquez-Abad A Course on Sensitivity Analysis for Gradient Estimation of Des Performance Measures , 2000 .

[52]  F. Vázquez-Abad,et al.  Measure valued differentiation for stochastic processes : the finite horizon case , 2000 .

[53]  Yoram Baram,et al.  The Bias-Variance Dilemma of the Monte Carlo Method , 2001, ICANN.

[54]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[55]  P. Glynn,et al.  Some New Perspectives on the Method of Control Variates , 2002 .

[56]  E. Gobet SENSITIVITY ANALYSIS USING ITÔ – MALLIAVIN CALCULUS AND , 2002 .

[57]  G. Pflug Score Function Method , 2006 .

[58]  E. Benhamou Optimal Malliavin Weighting Function for the Computation of the Greeks , 2003 .

[59]  Paul Glasserman,et al.  Monte Carlo Methods in Financial Engineering , 2003 .

[60]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[61]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[62]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[63]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[64]  J. Geweke Getting It Right , 2004 .

[65]  Antti Honkela,et al.  Variational learning and bits-back coding: an information-theoretic view to Bayesian learning , 2004, IEEE Transactions on Neural Networks.

[66]  Charles M. Bishop,et al.  Variational Message Passing , 2005, J. Mach. Learn. Res..

[67]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[68]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[69]  Georg Ch. Pflug Sampling derivatives of probabilities , 2005, Computing.

[70]  X. Yi On Automatic Differentiation , 2005 .

[71]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[72]  Rémi Munos,et al.  Policy Gradient in Continuous Time , 2006, J. Mach. Learn. Res..

[73]  P. Glasserman,et al.  Malliavin Greeks without Malliavin calculus , 2007 .

[74]  H. Robbins A Stochastic Approximation Method , 1951 .

[75]  Gareth O. Roberts,et al.  A General Framework for the Parametrization of Hierarchical Models , 2007, 0708.3797.

[76]  Stephen M. Stigler,et al.  c ○ Institute of Mathematical Statistics, 2007 The Epic Story of Maximum Likelihood , 2022 .

[77]  Warren Volk-Makarewicz,et al.  Sensitivity estimation for Gaussian systems , 2008, Eur. J. Oper. Res..

[78]  Luca Capriotti,et al.  Reducing the variance of likelihood ratio greeks in Monte Carlo , 2008, 2008 Winter Simulation Conference.

[79]  Manfred Opper,et al.  The Variational Gaussian Approximation Revisited , 2009, Neural Computation.

[80]  Bernd Heidergott,et al.  Weak Differentiability of Product Measures , 2010, Math. Oper. Res..

[81]  Mohammad Emtiyaz Khan,et al.  Variational bounds for mixed-data factor analysis , 2010, NIPS.

[82]  Mohammad Emtiyaz Khan,et al.  Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models , 2011, ICML.

[83]  Noah D. Goodman,et al.  Nonstandard Interpretations of Probabilistic Programs for Efficient Inference , 2011, NIPS.

[84]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[85]  John Sibert,et al.  AD Model Builder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models , 2012, Optim. Methods Softw..

[86]  Michael I. Jordan,et al.  Variational Bayesian Inference with Stochastic Search , 2012, ICML.

[87]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[88]  Jian-Qiang Hu,et al.  Conditional Monte Carlo: Gradient Estimation and Optimization Applications , 2012 .

[89]  David Wingate,et al.  Automated Variational Inference in Probabilistic Programming , 2013, ArXiv.

[90]  J. Norris Appendix: probability and measure , 1997 .

[91]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[92]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[93]  Ulrich Paquet,et al.  One-class collaborative filtering with random graphs , 2013, WWW '13.

[94]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[95]  G. Leobacher,et al.  Introduction to Quasi-Monte Carlo Integration and Applications , 2014 .

[96]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[97]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[98]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[99]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[100]  Yoshua Bengio,et al.  Deep Generative Stochastic Networks Trainable by Backprop , 2013, ICML.

[101]  N. Chopin,et al.  Control functionals for Monte Carlo integration , 2014, 1410.2392.

[102]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[103]  Max Welling,et al.  Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets , 2014, ICML.

[104]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[105]  James T. Kwok,et al.  Fast Second Order Stochastic Backpropagation for Variational Inference , 2015, NIPS.

[106]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[107]  Lester W. Mackey,et al.  Measuring Sample Quality with Stein's Method , 2015, NIPS.

[108]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[109]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[110]  Kjell A. Doksum,et al.  Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, Second Edition , 2015 .

[111]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[112]  Miguel Lázaro-Gredilla,et al.  Local Expectation Gradients for Black Box Variational Inference , 2015, NIPS.

[113]  David M. Blei,et al.  Stochastic Structured Variational Inference , 2014, AISTATS.

[114]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[115]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[116]  Xiaowei Hu,et al.  (Bandit) Convex Optimization with Biased Noisy Gradient Oracles , 2015, AISTATS.

[117]  Shakir Mohamed,et al.  Learning in Implicit Generative Models , 2016, ArXiv.

[118]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[119]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[120]  David M. Blei,et al.  The Generalized Reparameterization Gradient , 2016, NIPS.

[121]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[122]  Alex Graves,et al.  Stochastic Backpropagation through Mixture Density Distributions , 2016, ArXiv.

[123]  Dustin Tran,et al.  Operator Variational Inference , 2016, NIPS.

[124]  Noah D. Goodman,et al.  Deep Amortized Inference for Probabilistic Programs , 2016, ArXiv.

[125]  Sergey Levine,et al.  MuProp: Unbiased Backpropagation for Stochastic Neural Networks , 2015, ICLR.

[126]  Kevin Leyton-Brown,et al.  Counterfactual Prediction with Deep Instrumental Variables Networks , 2016, ArXiv.

[127]  T. Weber,et al.  Stochastic Gradient Estimation With Finite Differences , 2016 .

[128]  Scott W. Linderman,et al.  Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms , 2016, AISTATS.

[129]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[130]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[131]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[132]  Alexander D'Amour,et al.  Reducing Reparameterization Gradient Variance , 2017, NIPS.

[133]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[134]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[135]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[136]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[137]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[138]  Rajesh Ranganath,et al.  Black Box Variational Inference: Scalable, Generic Bayesian Computation and its Applications , 2017 .

[139]  Carl E. Rasmussen,et al.  PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos , 2019, ICML.

[140]  Frank Hutter,et al.  Maximizing acquisition functions for Bayesian optimization , 2018, NeurIPS.

[141]  Paavo Parmas,et al.  Total stochastic gradient algorithms and applications in reinforcement learning , 2019, NeurIPS.

[142]  Richard E. Turner,et al.  Gradient Estimators for Implicit Models , 2017, ICLR.

[143]  Martin Jankowiak,et al.  Pathwise Derivatives Beyond the Reparameterization Trick , 2018, ICML.

[144]  Oriol Vinyals,et al.  Learning Implicit Generative Models with the Method of Learned Moments , 2018, ICML.

[145]  Hongseok Yang,et al.  Reparameterization Gradient for Non-differentiable Models , 2018, NeurIPS.

[146]  Shakir Mohamed,et al.  Implicit Reparameterization Gradients , 2018, NeurIPS.

[147]  Shimon Whiteson,et al.  DiCE: The Infinitely Differentiable Monte-Carlo Estimator , 2018, ICML.

[148]  Stephan Mandt,et al.  Quasi-Monte Carlo Variational Inference , 2018, ICML.

[149]  Jun Zhu,et al.  A Spectral Approach to Gradient Estimation for Implicit Distributions , 2018, ICML.

[150]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[151]  Shimon Whiteson,et al.  A Better Baseline for Second Order Gradient Estimation in Stochastic Computation Graphs , 2018 .

[152]  Robert Kohn,et al.  Variance reduction properties of the reparameterization trick , 2018, AISTATS.

[153]  Lawrence Carin,et al.  GO Gradient for Expectation-Based Objectives , 2019, ICLR.

[154]  Shimon Whiteson,et al.  A Baseline for Any Order Gradient Estimation in Stochastic Computation Graphs , 2019, ICML.

[155]  Michael Figurnov,et al.  Measure-Valued Derivatives for Approximate Bayesian Inference , 2019 .

[156]  Theofanis Karaletsos,et al.  Pathwise Derivatives for Multivariate Distributions , 2018, AISTATS.

[157]  David Silver,et al.  Credit Assignment Techniques in Stochastic Computation Graphs , 2019, AISTATS.

[158]  Richard Nock,et al.  New Tricks for Estimating Gradients of Expectations , 2019, ArXiv.

[159]  M. Fu,et al.  Differentiation via Logarithmic Expansions , 2016, Asia Pac. J. Oper. Res..