Challenging Common Assumptions about Catastrophic Forgetting

Building learning agents that can progressively learn and accumulate knowledge is the core goal of the continual learning (CL) research field. Unfortunately, training a model on new data usually compromises the performance on past data. In the CL literature, this effect is referred to as catastrophic forgetting (CF). CF has been largely studied, and a plethora of methods have been proposed to address it on short sequences of non-overlapping tasks. In such setups, CF always leads to a quick and significant drop in performance in past tasks. Nevertheless, despite CF, recent work showed that SGD training on linear models accumulates knowledge in a CL regression setup. This phenomenon becomes especially visible when tasks reoccur. We might then wonder if DNNs trained with SGD or any standard gradient-based optimization accumulate knowledge in such a way. Such phenomena would have interesting consequences for applying DNNs to real continual scenarios. Indeed, standard gradient-based optimization methods are significantly less computationally expensive than existing CL algorithms. In this paper, we study the progressive knowledge accumulation (KA) in DNNs trained with gradient-based algorithms in long sequences of tasks with data re-occurrence. We propose a new framework, SCoLe (Scaling Continual Learning), to investigate KA and discover that catastrophic forgetting has a limited effect on DNNs trained with SGD. When trained on long sequences with data sparsely re-occurring, the overall accuracy improves, which might be counter-intuitive given the CF phenomenon. We empirically investigate KA in DNNs under various data occurrence frequencies and propose simple and scalable strategies to increase knowledge accumulation in DNNs.

[1]  Bin Liu,et al.  Continual Pre-training of Language Models , 2023, ICLR.

[2]  D. Bacciu,et al.  Class-Incremental Learning with Repetition , 2023, arXiv.org.

[3]  Visvanathan Ramesh,et al.  A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning , 2020, Neural Networks.

[4]  Sen Lin,et al.  Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer , 2022, NeurIPS.

[5]  Daniel Soudry,et al.  How catastrophic can catastrophic forgetting be in linear regression? , 2022, COLT.

[6]  Sida I. Wang,et al.  On Continual Model Refinement in Out-of-Distribution Data Streams , 2022, ACL.

[7]  Md Rifat Arefin,et al.  Continual Learning with Foundation Models: An Empirical Study of Latent Replay , 2022, CoLLAs.

[8]  S. Mudur,et al.  Probing Representation Forgetting in Supervised and Unsupervised Continual Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Alahari Karteek,et al.  Self-Supervised Models are Continual Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Davide Bacciu,et al.  Is Class-Incremental Enough for Continual Learning? , 2021, Frontiers in Artificial Intelligence.

[11]  Seyed Iman Mirzadeh,et al.  Wide Neural Networks Forget Less Catastrophically , 2021, ICML.

[12]  T. Tuytelaars,et al.  New Insights on Reducing Abrupt Representation Change in Online Continual Learning , 2021, International Conference on Learning Representations.

[13]  Aitor Lewkowycz,et al.  Effect of scale on catastrophic forgetting in neural networks , 2022, ICLR.

[14]  S. Muresan,et al.  Continual-T0: Progressively Instructing 50+ Tasks to Language Models Without Forgetting , 2022, ArXiv.

[15]  I. Rish,et al.  Continual Learning in Deep Networks: an Analysis of the Last Layer , 2021, ArXiv.

[16]  Masato Okada,et al.  Statistical Mechanical Analysis of Catastrophic Forgetting in Continual Learning with Teacher and Student Networks , 2021, Journal of the Physical Society of Japan.

[17]  Massimo Caccia,et al.  Understanding Continual Learning Settings with Data Distribution Drift Analysis , 2021, ArXiv.

[18]  Arthur Douillard,et al.  Continuum: Simple Management of Complex Continual Learning Scenarios , 2021, ArXiv.

[19]  Marc'Aurelio Ranzato,et al.  Efficient Continual Learning with Modular Networks and Task-Driven Priors , 2020, ICLR.

[20]  Ioannis Kanellos,et al.  A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks , 2020, Neural Networks.

[21]  Pierre Alquier,et al.  A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix , 2020, AISTATS.

[22]  Ethan Dyer,et al.  Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics , 2020, ICLR.

[23]  Elad Hoffer,et al.  Task Agnostic Continual Learning Using Online Variational Bayes , 2018, 1803.10123.

[24]  Andrei A. Rusu,et al.  Embracing Change: Continual Learning in Deep Neural Networks , 2020, Trends in Cognitive Sciences.

[25]  E. Ricci,et al.  Online Continual Learning under Extreme Memory Constraints , 2020, European Conference on Computer Vision.

[26]  Ali Farhadi,et al.  Supermasks in Superposition , 2020, NeurIPS.

[27]  Andreas Krause,et al.  Coresets via Bilevel Optimization for Continual Learning and Streaming , 2020, NeurIPS.

[28]  Min Lin,et al.  Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning , 2020, ArXiv.

[29]  Joelle Pineau,et al.  Online Learned Continual Compression with Adaptive Quantization Modules , 2019, ICML.

[30]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2019, NeurIPS.

[31]  Tyler L. Hayes,et al.  REMIND Your Neural Network to Prevent Catastrophic Forgetting , 2019, ECCV.

[32]  Natalia Díaz Rodríguez,et al.  Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges , 2019, Inf. Fusion.

[33]  David Filliat,et al.  Regularization Shortcomings for Continual Learning , 2019, ArXiv.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Matthias De Lange,et al.  Continual learning: A comparative study on how to defy forgetting in classification tasks , 2019, ArXiv.

[36]  Tinne Tuytelaars,et al.  Online Continual Learning with Maximally Interfered Retrieval , 2019, ArXiv.

[37]  James M. Rehg,et al.  Incremental Object Learning From Contiguous Views , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Sebastian Ruder,et al.  Episodic Memory in Lifelong Language Learning , 2019, NeurIPS.

[39]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[40]  Marc'Aurelio Ranzato,et al.  Continual Learning with Tiny Episodic Memories , 2019, ArXiv.

[41]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[42]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[43]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[44]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[45]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[48]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[49]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.