An Improved Trust-Region Method for Off-Policy Deep Reinforcement Learning

Reinforcement learning (RL) is a powerful tool for training agents to interact with complex environments. In particular, trust-region methods are widely used for policy optimization in model-free RL. However, these methods suffer from high sample complexity due to their on-policy nature, which requires interactions with the environment for each update. To address this issue, off-policy trust-region methods have been proposed, but they have shown limited success in highdimensional continuous control problems compared to other off-policy DRL methods. To improve the performance and sample efficiency of trust-region policy optimization, we propose an off-policy trust-region RL algorithm. Our algorithm is based on a theoretical result on a closed-form solution to trust-region policy optimization and is effective in optimizing complex nonlinear policies. We demonstrate the superiority of our algorithm over prior trust-region DRL methods and show that it achieves excellent performance on a range of continuous control tasks in the Multi-Joint dynamics with Contact (MuJoCo) environment, comparable to state-of-the-art off-policy algorithms.

[1]  Hepeng Li,et al.  An Analytical Update Rule for General Policy Optimization , 2021, ICML.

[2]  Nicolas Le Roux,et al.  A general class of surrogate functions for stable and efficient reinforcement learning , 2021, AISTATS.

[3]  Xiangnan Zhong,et al.  A Reinforcement Learning-Based Control Approach for Unknown Nonlinear Systems with Persistent Adversarial Inputs , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[4]  Wenjia Meng,et al.  An Off-Policy Trust Region Policy Optimization Method With Monotonic Improvement Guarantee for Deep Reinforcement Learning , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Ngo Anh Vien,et al.  Differentiable Trust Region Layers for Deep Reinforcement Learning , 2021, ICLR.

[6]  Longbing Cao,et al.  Maximum Entropy Reinforcement Learning with Evolution Strategies , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[7]  M. Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ICLR.

[8]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[9]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[12]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[14]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[15]  J. Schulman,et al.  OpenAI Gym , 2016, ArXiv.

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[18]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[19]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[20]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[21]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[23]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[26]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[28]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[29]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[30]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[31]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[32]  Michael I. Jordan,et al.  Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient , 2021, ArXiv.

[33]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[34]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.