论文信息 - One Step at a Time: Pros and Cons of Multi-Step Meta-Gradient Reinforcement Learning - 字舞流文

One Step at a Time: Pros and Cons of Multi-Step Meta-Gradient Reinforcement Learning

Self-tuning algorithms that adapt the learning process online encourage more effective and robust learning. Among all the methods available, meta-gradients have emerged as a promising approach. They leverage the differentiability of the learning rule with respect to some hyper-parameters to adapt them in an online fashion. Although meta-gradients can be accumulated over multiple learning steps to avoid myopic updates, this is rarely used in practice. In this work, we demonstrate that whilst multi-step meta-gradients do provide a better learning signal in expectation, this comes at the cost of a significant increase in variance, hindering performance. In the light of this analysis, we introduce a novel method mixing multiple inner steps that enjoys a more accurate and robust meta-gradient signal, essentially trading off bias and variance in meta-gradient estimation. When applied to the Snake game, the mixing meta-gradient algorithm can cut the variance by a factor of 3 while achieving similar or higher performance.

Alexandre Laterre | Ian Davies | Cl'ement Bonnet | Paul Caron | Thomas Barrett | Clément Bonnet | Alexandre Laterre | Paul Caron | Ian Davies | Thomas D. Barrett

[1] Timothy M. Hospedales,et al. Online Meta-Critic Learning for Off-Policy Actor-Critic Methods , 2020, NeurIPS.

[2] Satinder Singh,et al. On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[3] Sergey Levine,et al. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[4] Shimon Whiteson,et al. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[5] Junhyuk Oh,et al. Discovering Reinforcement Learning Algorithms , 2020, NeurIPS.

[6] Louis Kirsch,et al. Improving Generalization in Meta Reinforcement Learning using Learned Objectives , 2020, ICLR.

[7] Junhyuk Oh,et al. A Self-Tuning Actor-Critic Algorithm , 2020, NeurIPS.

[8] David Silver,et al. Bootstrapped Meta-Learning , 2021, ArXiv.

[9] Richard L. Lewis,et al. Discovery of Useful Questions as Auxiliary Tasks , 2019, NeurIPS.

[10] Katja Hofmann,et al. Fast Context Adaptation via Meta-Learning , 2018, ICML.

[11] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[12] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[13] Sergey Levine,et al. Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[14] Sergey Levine,et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[15] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[16] Tie-Yan Liu,et al. Beyond Exponentially Discounted Sum: Automatic Learning of Return Function , 2019, ArXiv.

[17] Yevgen Chebotar,et al. Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[18] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[20] Jeremy Nixon,et al. Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.

[21] David Silver,et al. Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[22] Pieter Abbeel,et al. Evolved Policy Gradients , 2018, NeurIPS.

[23] Ron Meir,et al. Discount Factor as a Regularizer in Reinforcement Learning , 2020, ICML.

[24] Junhyuk Oh,et al. Meta-Gradient Reinforcement Learning with an Objective Discovered Online , 2020, NeurIPS.