Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution

Recently, reinforcement learning with deep neural networks has achieved great success in challenging continuous control problems such as 3D locomotion and robotic manipulation. However, in real-world control problems, the actions one can take are bounded by physical constraints, which introduces a bias when the standard Gaussian distribution is used as the stochastic policy. In this work, we propose to use the Beta distribution as an alternative and analyze the bias and variance of the policy gradients of both policies. We show that the Beta policy is bias-free and provides significantly faster convergence and higher scores over the Gaussian policy when both are used with trust region policy optimization (TRPO) and actor critic with experience replay (ACER), the state-of-the-art on- and off-policy stochastic methods respectively, on OpenAI Gym's and MuJoCo's continuous control environments.

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[4]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Stuart Barber,et al.  All of Statistics: a Concise Course in Statistical Inference , 2005 .

[12]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[13]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[14]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[15]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[16]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[17]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[18]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[19]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[20]  Pieter Abbeel,et al.  Value Iteration Networks , 2016, NIPS.

[21]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[22]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[23]  Frank Sehnke,et al.  Policy Gradients with Parameter-Based Exploration for Control , 2008, ICANN.

[24]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[25]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[26]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  R Bellman,et al.  DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[29]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[30]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[31]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[32]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[33]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[34]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[35]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[36]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[39]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[40]  A V Herz,et al.  Neural codes: firing rates and beyond. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[42]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.