论文信息 - Action-depedent Control Variates for Policy Optimization via Stein's Identity

Action-depedent Control Variates for Policy Optimization via Stein's Identity

Policy gradient methods have achieved remarkable successes in solving challenging reinforcement learning problems. However, it still often suffers from the large variance issue on policy gradient estimation, which leads to poor sample efficiency during training. In this work, we propose a control variate method to effectively reduce variance for policy gradient methods. Motivated by the Stein's identity, our method extends the previous control variate methods used in REINFORCE and advantage actor-critic by introducing more general action-dependent baseline functions. Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.

[1] C. Stein. Approximate computation of expectations , 1986 .

[2] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[4] Lex Weaver,et al. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[5] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[6] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7] Louis H. Y. Chen,et al. An Introduction to Stein's Method , 2005 .

[8] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[10] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[11] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[12] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[13] N. Chopin,et al. Control functionals for Monte Carlo integration , 2014, 1410.2392.

[14] Lester W. Mackey,et al. Measuring Sample Quality with Stein's Method , 2015, NIPS.

[15] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[18] M. Girolami,et al. Control Functionals for Quasi-Monte Carlo Integration , 2015, AISTATS.

[19] Anima Anandkumar,et al. Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[20] Qiang Liu,et al. A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.