One Step at a Time: Pros and Cons of Multi-Step Meta-Gradient Reinforcement Learning

Self-tuning algorithms that adapt the learning process online encourage more effective and robust learning. Among all the methods available, meta-gradients have emerged as a promising approach. They leverage the differentiability of the learning rule with respect to some hyper-parameters to adapt them in an online fashion. Although meta-gradients can be accumulated over multiple learning steps to avoid myopic updates, this is rarely used in practice. In this work, we demonstrate that whilst multi-step meta-gradients do provide a better learning signal in expectation, this comes at the cost of a significant increase in variance, hindering performance. In the light of this analysis, we introduce a novel method mixing multiple inner steps that enjoys a more accurate and robust meta-gradient signal, essentially trading off bias and variance in meta-gradient estimation. When applied to the Snake game, the mixing meta-gradient algorithm can cut the variance by a factor of 3 while achieving similar or higher performance.

[1]  Timothy M. Hospedales,et al.  Online Meta-Critic Learning for Off-Policy Actor-Critic Methods , 2020, NeurIPS.

[2]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[3]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[4]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[5]  Junhyuk Oh,et al.  Discovering Reinforcement Learning Algorithms , 2020, NeurIPS.

[6]  Louis Kirsch,et al.  Improving Generalization in Meta Reinforcement Learning using Learned Objectives , 2020, ICLR.

[7]  Junhyuk Oh,et al.  A Self-Tuning Actor-Critic Algorithm , 2020, NeurIPS.

[8]  David Silver,et al.  Bootstrapped Meta-Learning , 2021, ArXiv.

[9]  Richard L. Lewis,et al.  Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[10]  Katja Hofmann,et al.  Fast Context Adaptation via Meta-Learning , 2018, ICML.

[11]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[12]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[13]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[14]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[15]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[16]  Tie-Yan Liu,et al.  Beyond Exponentially Discounted Sum: Automatic Learning of Return Function , 2019, ArXiv.

[17]  Yevgen Chebotar,et al.  Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[20]  Jeremy Nixon,et al.  Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.

[21]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[22]  Pieter Abbeel,et al.  Evolved Policy Gradients , 2018, NeurIPS.

[23]  Ron Meir,et al.  Discount Factor as a Regularizer in Reinforcement Learning , 2020, ICML.

[24]  Junhyuk Oh,et al.  Meta-Gradient Reinforcement Learning with an Objective Discovered Online , 2020, NeurIPS.