Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Model-free deep reinforcement learning (RL) algorithms have been demonstrated on a range of challenging decision making and control tasks. However, these methods typically suffer from two major challenges: very high sample complexity and brittle convergence properties, which necessitate meticulous hyperparameter tuning. Both of these challenges severely limit the applicability of such methods to complex, real-world domains. In this paper, we propose soft actor-critic, an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. By combining off-policy updates with a stable stochastic actor-critic formulation, our method achieves state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving very similar performance across different random seeds.

[1]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .

[2]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Chang Wang,et al.  Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization , 2007, AAAI.

[6]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[8]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[9]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[10]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[11]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[12]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[13]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[14]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[15]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[16]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[17]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[18]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[23]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[24]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[25]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[28]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[29]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[30]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[31]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[32]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[33]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[34]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[35]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[37]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[38]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[39]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[40]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.