Meta-Reinforcement Learning by Tracking Task Non-stationarity

Many real-world domains are subject to a structured non-stationarity which affects the agent's goals and the environmental dynamics. Meta-reinforcement learning (RL) has been shown successful for training agents that quickly adapt to related tasks. However, most of the existing meta-RL algorithms for non-stationary domains either make strong assumptions on the task generation process or require sampling from it at training time. In this paper, we propose a novel algorithm (TRIO) that optimizes for the future by explicitly tracking the task evolution through time. At training time, TRIO learns a variational module to quickly identify latent parameters from experience samples. This module is learned jointly with an optimal exploration policy that takes task uncertainty into account. At test time, TRIO tracks the evolution of the latent parameters online, hence reducing the uncertainty over future tasks and obtaining fast adaptation through the meta-learned policy. Unlike most existing methods, TRIO does not assume Markovian task-evolution processes, it does not require information about the non-stationarity at training time, and it captures complex changes undergoing in the environment. We evaluate our algorithm on different simulated problems and show it outperforms competitive baselines.

[1]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[2]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[3]  Jean-Baptiste Mouret,et al.  Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Tamim Asfour,et al.  ProMP: Proximal Meta-Policy Search , 2018, ICLR.

[5]  Shimon Whiteson,et al.  VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning , 2020, ICLR.

[6]  Martha White,et al.  Optimizing for the Future in Non-Stationary MDPs , 2020, ICML.

[7]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.

[8]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[9]  Zhenguo Li,et al.  Meta Reinforcement Learning with Task Embedding and Shared Policy , 2019, IJCAI.

[10]  Yee Whye Teh,et al.  Meta reinforcement learning as task inference , 2019, ArXiv.

[11]  Sindhu Padakandla,et al.  A Survey of Reinforcement Learning Algorithms for Dynamically Varying Environments , 2020, ACM Comput. Surv..

[12]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[13]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[14]  Amos Storkey,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE transactions on pattern analysis and machine intelligence.

[15]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[16]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[17]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Marcello Restelli,et al.  Transfer of Samples in Policy Search via Multiple Importance Sampling , 2019, ICML.

[20]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[21]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[22]  Sergey Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Richard Socher,et al.  Taming MAML: Efficient unbiased meta-reinforcement learning , 2019, ICML.

[25]  Ludovic Denoyer,et al.  Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization , 2020, ArXiv.

[26]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[27]  John Hallam,et al.  International Conference on Artificial Neural Networks , 2005 .

[28]  Chelsea Finn,et al.  Deep Reinforcement Learning amidst Lifelong Non-Stationarity , 2020, ArXiv.

[29]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[30]  K. Pearson,et al.  Biometrika , 1902, The American Naturalist.