A Self-Tuning Actor-Critic Algorithm

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in $200$M steps from $243\%$ to $364\%$. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in $30$M steps from $217$ to $389$ when learning with features, from $108$ to $202$ when learning from pixels, and from $195$ to $295$ in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).

[1]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[2]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[3]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[4]  Will Dabney,et al.  Adaptive Trade-Offs in Off-Policy Learning , 2020, AISTATS.

[5]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[6]  Krzysztof Choromanski,et al.  Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies , 2020, ArXiv.

[7]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[8]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[9]  Martha White,et al.  A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[10]  Shimon Whiteson,et al.  Fast Efficient Hyperparameter Tuning for Policy Gradient Methods , 2019, NeurIPS.

[11]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[12]  Shimon Whiteson,et al.  Fast Efficient Hyperparameter Tuning for Policy Gradients , 2019, NeurIPS.

[13]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[14]  Karen Simonyan,et al.  Off-Policy Actor-Critic with Shared Experience Replay , 2020, ICML.

[15]  Shie Mannor,et al.  Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[16]  Nir Levine,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.

[17]  Matthew E. Taylor,et al.  Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control , 2018, IJCAI.

[18]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[19]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[20]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[23]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[24]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[25]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[26]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[27]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.