Task-Agnostic Continual Reinforcement Learning: In Praise of a Simple Baseline

We study task-agnostic continual reinforcement learning (TACRL) in which standard RL challenges are compounded with partial observability stemming from task agnosticism, as well as additional difficulties of continual learning (CL), i.e., learning on a non-stationary sequence of tasks. Here we compare TACRL methods with their soft upper bounds prescribed by previous literature: multi-task learning (MTL) methods which do not have to deal with non-stationary data distributions, as well as task-aware methods, which are allowed to operate under full observability . We consider a previously unexplored and straightforward baseline for TACRL, replay-based recurrent RL (3RL), in which we augment an RL algorithm with recurrent mechanisms to address partial observability and experience replay mechanisms to address catastrophic forgetting in CL. Studying empirical performance in a sequence of RL tasks, we find surprising occurrences of 3RL matching and overcoming the MTL and task-aware soft upper bounds. We lay out hypotheses that could explain this inflection point of continual and task-agnostic learning research. Our hypotheses are empirically tested in continuous control tasks via a large-scale study of the popular multi-task and continual learning benchmark Meta-World. By analyzing different training statistics including gradient conflict, we find evidence that 3RL’s outperformance stems from its ability to quickly infer how new tasks relate with the previous ones, enabling forward transfer.

[1]  Jorge Armando Mendez Mendez,et al.  How to Reuse and Compose Knowledge for a Lifetime of Tasks: A Survey on Continual Learning and Functional Composition , 2022, Trans. Mach. Learn. Res..

[2]  Sergey Levine,et al.  CoMPS: Continual Meta Policy Search , 2021, ICLR.

[3]  Pau Rodríguez López,et al.  Sequoia: A Software Framework to Unify Continual Learning Research , 2021, ArXiv.

[4]  Stephen J. Roberts,et al.  Same State, Different Task: Continual Reinforcement Learning without Interference , 2021, AAAI.

[5]  Razvan Pascanu,et al.  Continual World: A Robotic Benchmark For Continual Reinforcement Learning , 2021, NeurIPS.

[6]  Massimo Caccia,et al.  Understanding Continual Learning Settings with Data Distribution Drift Analysis , 2021, ArXiv.

[7]  D. Bacciu,et al.  Continual Learning for Recurrent Neural Networks: an Empirical Evaluation , 2021, Neural Networks.

[8]  Joelle Pineau,et al.  Multi-Task Reinforcement Learning with Context-based Representations , 2021, ICML.

[9]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, J. Artif. Intell. Res..

[10]  Andrei A. Rusu,et al.  Embracing Change: Continual Learning in Deep Neural Networks , 2020, Trends in Cognitive Sciences.

[11]  Eric Eaton,et al.  Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting , 2020, NeurIPS.

[12]  Maria R. Cervera,et al.  Continual learning in recurrent neural networks , 2020, ICLR.

[13]  Wenhao Ding,et al.  Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes , 2020, NeurIPS.

[14]  Chelsea Finn,et al.  Deep Reinforcement Learning amidst Lifelong Non-Stationarity , 2020, ArXiv.

[15]  Timothée Lesort Continual Learning: Tackling Catastrophic Forgetting in Deep Neural Networks with Replay Processes , 2020, ArXiv.

[16]  Yi Wu,et al.  Multi-Task Reinforcement Learning with Soft Modularization , 2020, NeurIPS.

[17]  Sergey Levine,et al.  DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction , 2020, NeurIPS.

[18]  David Vázquez,et al.  Online Fast Adaptation and Knowledge Accumulation (OSAKA): a New Approach to Continual Learning , 2020, NeurIPS.

[19]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[20]  David Filliat,et al.  Regularization Shortcomings for Continual Learning , 2019, ArXiv.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  S. Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[23]  Alex Smola,et al.  Meta-Q-Learning , 2019, ICLR.

[24]  Francisco S. Melo,et al.  Multi-task Learning and Catastrophic Forgetting in Continual Reinforcement Learning , 2019, GCAI.

[25]  Matthias De Lange,et al.  A Continual Learning Survey: Defying Forgetting in Classification Tasks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Tinne Tuytelaars,et al.  Online Continual Learning with Maximally Interfered Retrieval , 2019, ArXiv.

[27]  David Filliat,et al.  DisCoRL: Continual Reinforcement Learning via Policy Distillation , 2019, ArXiv.

[28]  Tayo Obafemi-Ajayi,et al.  Recurrent Network and Multi-arm Bandit Methods for Multi-task Learning without Task Specification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[29]  Yee Whye Teh,et al.  Task Agnostic Continual Learning via Meta Learning , 2019, ArXiv.

[30]  Efthymios Tzinis,et al.  Continual Learning of New Sound Classes Using Generative Replay , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[31]  Mikhail S. Burtsev,et al.  Continual and Multi-task Reinforcement Learning With Shared Episodic Memory , 2019, ArXiv.

[32]  Alexander J. Smola,et al.  P3O: Policy-on Policy-off Policy Optimization , 2019, UAI.

[33]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[34]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[35]  Marc'Aurelio Ranzato,et al.  Continual Learning with Tiny Episodic Memories , 2019, ArXiv.

[36]  S. Levine,et al.  Deep Online Learning via Meta-Learning: Continual Adaptation for Model-Based RL , 2018, ICLR.

[37]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[38]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.

[39]  Thomas Wolf,et al.  Continuous Learning in a Hierarchical Multiscale Neural Network , 2018, ACL.

[40]  Elad Hoffer,et al.  Task Agnostic Continual Learning Using Online Variational Bayes , 2018, 1803.10123.

[41]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[42]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[43]  Svetlana Lazebnik,et al.  PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Qiang Yang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[45]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[46]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[47]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[48]  David Silver,et al.  Memory-based control with recurrent neural networks , 2015, ArXiv.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[51]  Daniele Calandriello,et al.  Sparse multi-task reinforcement learning , 2014, Intelligenza Artificiale.

[52]  Eric Eaton,et al.  Online Multi-Task Learning for Policy Gradient Methods , 2014, ICML.

[53]  Finale Doshi-Velez,et al.  Hidden Parameter Markov Decision Processes: A Semiparametric Regression Approach for Discovering Latent Task Parametrizations , 2013, IJCAI.

[54]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Manfred Huber,et al.  Improving tractability of POMDPs by separation of decision and perceptual processes , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[56]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[57]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[58]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[59]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[60]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[61]  Long Ji Lin,et al.  Reinforcement Learning of Non-Markov Decision Processes , 1995, Artif. Intell..

[62]  M. Puterman Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[63]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[64]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[65]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[66]  G. Qi,et al.  Pretrained Language Model in Continual Learning: A Comparative Study , 2022, ICLR.

[67]  Ruslan Salakhutdinov,et al.  Recurrent Model-Free RL is a Strong Baseline for Many POMDPs , 2021, ArXiv.

[68]  Dit-Yan Yeung,et al.  Hidden-Mode Markov Decision Processes for Nonstationary Sequential Decision Making , 2001, Sequence Learning.

[69]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[70]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .