A Self-Tuning Actor-Critic Algorithm

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in $200$M steps from $243\%$ to $364\%$. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in $30$M steps from $217$ to $389$ when learning with features, from $108$ to $202$ when learning from pixels, and from $195$ to $295$ in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).

[1]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[2]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[3]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[4]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[5]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[6]  Shie Mannor,et al.  Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[7]  Martha White,et al.  A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[8]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[9]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[10]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[11]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[12]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[13]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[14]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[15]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[16]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[17]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[18]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[19]  Shimon Whiteson,et al.  Fast Efficient Hyperparameter Tuning for Policy Gradient Methods , 2019, NeurIPS.

[20]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[21]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[22]  Shimon Whiteson,et al.  Fast Efficient Hyperparameter Tuning for Policy Gradients , 2019, NeurIPS.

[23]  Matthew E. Taylor,et al.  Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control , 2018, IJCAI.

[24]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[25]  R. Munos,et al.  Adaptive Trade-Offs in Off-Policy Learning , 2019, AISTATS.

[26]  D. Mankowitz,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.

[27]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[28]  Krzysztof Choromanski,et al.  Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies , 2020, ArXiv.

[29]  Off-Policy Actor-Critic with Shared Experience Replay , 2019, ICML.