Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[3]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[4]  Shigenobu Kobayashi,et al.  An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[5]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[6]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[7]  Alexander D'Amour,et al.  Reducing Reparameterization Gradient Variance , 2017, NIPS.

[8]  David M. Blei,et al.  Overdispersed Black-Box Variational Inference , 2016, UAI.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Andriy Mnih,et al.  Variational Inference for Monte Carlo Objectives , 2016, ICML.

[11]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[12]  Hao Liu,et al.  Sample-efficient Policy Optimization with Stein Control Variate , 2017, ArXiv.

[13]  Dengyong Zhou,et al.  Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .

[14]  N. Chopin,et al.  Control functionals for Monte Carlo integration , 2014, 1410.2392.

[15]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[16]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  B. Speelpenning Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[19]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  David Barber,et al.  Variational Optimization , 2012, ArXiv.

[22]  Louis B. Rall,et al.  Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[23]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[24]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[25]  David M. Blei,et al.  The Generalized Reparameterization Gradient , 2016, NIPS.

[26]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[27]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Alexandre M. Bayen,et al.  Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[29]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[30]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[31]  H. Robbins A Stochastic Approximation Method , 1951 .

[32]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[33]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[34]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[35]  Scott W. Linderman,et al.  Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms , 2016, AISTATS.

[36]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[37]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines - Revised , 2015 .