论文信息 - Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

Backpropagation through the Void: Optimizing control variates for black-box gradient estimation

Gradient-based optimization is the foundation of deep learning and reinforcement learning. Even when the mechanism being optimized is unknown or not differentiable, optimization using high-variance or biased gradient estimates is still often the best strategy. We introduce a general framework for learning low-variance, unbiased gradient estimators for black-box functions of random variables. Our method uses gradients of a neural network trained jointly with model parameters or policies, and is applicable in both discrete and continuous settings. We demonstrate this framework for training discrete latent-variable models. We also give an unbiased, action-conditional extension of the advantage actor-critic reinforcement learning algorithm.

[1] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[2] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[3] Pieter Abbeel,et al. Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[4] Shigenobu Kobayashi,et al. An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value Function , 1998, ICML.

[5] Xi Chen,et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[6] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[7] Alexander D'Amour,et al. Reducing Reparameterization Gradient Variance , 2017, NIPS.

[8] David M. Blei,et al. Overdispersed Black-Box Variational Inference , 2016, UAI.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Andriy Mnih,et al. Variational Inference for Monte Carlo Objectives , 2016, ICML.

[11] Jascha Sohl-Dickstein,et al. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[12] Hao Liu,et al. Sample-efficient Policy Optimization with Stein Control Variate , 2017, ArXiv.

[13] Dengyong Zhou,et al. Action-depedent Control Variates for Policy Optimization via Stein's Identity , 2017 .

[14] N. Chopin,et al. Control functionals for Monte Carlo integration , 2014, 1410.2392.

[15] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[16] Elman Mansimov,et al. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[17] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18] B. Speelpenning. Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[19] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[20] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21] David Barber,et al. Variational Optimization , 2012, ArXiv.

[22] Louis B. Rall,et al. Automatic Differentiation: Techniques and Applications , 1981, Lecture Notes in Computer Science.

[23] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[24] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[25] David M. Blei,et al. The Generalized Reparameterization Gradient , 2016, NIPS.

[26] Karol Gregor,et al. Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[27] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28] Alexandre M. Bayen,et al. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines , 2018, ICLR.

[29] Richard E. Turner,et al. Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[30] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[31] H. Robbins. A Stochastic Approximation Method , 1951 .

[32] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[33] Sergey Levine,et al. Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[34] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[35] Scott W. Linderman,et al. Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms , 2016, AISTATS.

[36] Sergey Levine,et al. The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[37] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines - Revised , 2015 .