Foundational Models for Continual Learning: An Empirical Study of Latent Replay

Rapid development of large-scale pre-training has resulted in foundation models that can act as effective feature extractors on a variety of downstream tasks and domains. Motivated by this, we study the efficacy of pre-trained vision models as a foundation for downstream continual learning (CL) scenarios. Our goal is twofold. First, we want to understand the compute-accuracy trade-off between CL in the raw-data space and in the latent space of pre-trained encoders. Second, we investigate how the characteristics of the encoder, the pre-training algorithm and data, as well as of the resulting latent space affect CL performance. For this, we compare the efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space. Notably, this study shows how transfer, forgetting, task similarity and learning are dependent on the input data characteristics and not necessarily on the CL algorithms. First, we show that under some circumstances reasonable CL performance can readily be achieved with a non-parametric classifier at negligible compute. We then show how models pre-trained on broader data result in better performance for various replay sizes. We explain this with representational similarity and transfer properties of these representations. Finally, we show the effectiveness of self-supervised (SSL) pre-training for downstream domains that are out-of-distribution as compared to the pre-training domain. We point out and validate several research directions that can further increase the efficacy of latent CL including representation ensembling. The diverse set of datasets used in this study can serve as a compute-efficient playground for further CL research. Codebase is available under https://github.com/oleksost/latent_CL.

[1]  Razvan Pascanu,et al.  Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? , 2022, ArXiv.

[2]  H. Larochelle,et al.  Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning , 2022, ICML.

[3]  Marcus Hutter,et al.  On the Role of Neural Collapse in Transfer Learning , 2021, ICLR.

[4]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[5]  Stanley Jungkyu Choi,et al.  Towards Continual Knowledge Learning of Language Models , 2021, ICLR.

[6]  Xinchao Wang,et al.  How Well Does Self-Supervised Pre-Training Perform with Streaming Data? , 2021, ICLR.

[7]  Aitor Lewkowycz,et al.  Effect of scale on catastrophic forgetting in neural networks , 2022, ICLR.

[8]  G. Qi,et al.  Pretrained Language Model in Continual Learning: A Comparative Study , 2022, ICLR.

[9]  Sanket Vaibhav Mehta,et al.  An Empirical Investigation of the Role of Pre-training in Lifelong Learning , 2021, ArXiv.

[10]  M. Cord,et al.  DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[12]  Massimo Caccia,et al.  Continual Learning via Local Module Composition , 2021, NeurIPS.

[13]  My T. Thai,et al.  Continual Learning with Differential Privacy , 2021, ICONIP.

[14]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[15]  Nathalie Baracaldo,et al.  Privacy-Preserving Machine Learning: Methods, Challenges and Directions , 2021, ArXiv.

[16]  Alexander Gepperth,et al.  An Investigation of Replay-based Approaches for Continual Learning , 2021, 2021 International Joint Conference on Neural Networks (IJCNN).

[17]  Andrew M. Saxe,et al.  Continual Learning in the Teacher-Student Setup: Impact of Task Similarity , 2021, ICML.

[18]  Joelle Pineau,et al.  SPeCiaL: Self-Supervised Pretraining for Continual Learning , 2021, CSSL.

[19]  Dmitry Vetrov,et al.  Mean Embeddings with Test-Time Data Augmentation for Ensembling of Representations , 2021, ArXiv.

[20]  I. Rish,et al.  Continual Learning in Deep Networks: an Analysis of the Last Layer , 2021, ArXiv.

[21]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[22]  Massimo Caccia,et al.  Understanding Continual Learning Settings with Data Distribution Drift Analysis , 2021, ArXiv.

[23]  Tyler L. Hayes,et al.  Self-Supervised Training Enhances Online Continual Learning , 2021, BMVC.

[24]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25]  Arthur Douillard,et al.  Continuum: Simple Management of Complex Continual Learning Scenarios , 2021, ArXiv.

[26]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[27]  Marc'Aurelio Ranzato,et al.  Efficient Continual Learning with Modular Networks and Task-Driven Priors , 2020, ICLR.

[28]  Johan S. Obando-Ceron,et al.  Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2020, ICML.

[29]  Ioannis Kanellos,et al.  A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks , 2020, Neural Networks.

[30]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[31]  Jorge Armando Mendez Mendez,et al.  Lifelong Learning of Compositional Structures , 2020, ICLR.

[32]  Ethan Dyer,et al.  Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics , 2020, ICLR.

[33]  Matthieu Cord,et al.  Insights from the Future for Continual Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  M. Soleymani,et al.  Generative vs. Discriminative: Rethinking The Meta-Continual Learning , 2021, NeurIPS.

[35]  Philip H. S. Torr,et al.  GDumb: A Simple Approach that Questions Our Progress in Continual Learning , 2020, ECCV.

[36]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[37]  Hava T. Siegelmann,et al.  Brain-inspired replay for continual learning with artificial neural networks , 2020, Nature Communications.

[38]  Luc Van Gool,et al.  Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference , 2020, ECCV.

[39]  Marie-Francine Moens,et al.  Online Continual Learning from Imbalanced Data , 2020, ICML.

[40]  J. Lee,et al.  Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters , 2020, NeurIPS.

[41]  Timoth'ee Lesort,et al.  Continual Learning: Tackling Catastrophic Forgetting in Deep Neural Networks with Replay Processes , 2020, ArXiv.

[42]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[43]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[44]  Matthieu Cord,et al.  PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning , 2020, ECCV.

[45]  Bogdan Raducanu,et al.  Generative Feature Replay For Class-Incremental Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[46]  Fengqing Zhu,et al.  Incremental Learning in Online Scenario , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Min Lin,et al.  Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning , 2020, ArXiv.

[48]  B. Caputo,et al.  Modeling the Background for Incremental Learning in Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[50]  S. Gelly,et al.  Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[51]  Vincenzo Lomonaco,et al.  Latent Replay for Real-Time Continual Learning , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[52]  Tyler L. Hayes,et al.  REMIND Your Neural Network to Prevent Catastrophic Forgetting , 2019, European Conference on Computer Vision.

[53]  David Filliat,et al.  Regularization Shortcomings for Continual Learning , 2019, ArXiv.

[54]  Matthias De Lange,et al.  Continual learning: A comparative study on how to defy forgetting in classification tasks , 2019, ArXiv.

[55]  Tinne Tuytelaars,et al.  Online Continual Learning with Maximally Interfered Retrieval , 2019, ArXiv.

[56]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[57]  David Filliat,et al.  DisCoRL: Continual Reinforcement Learning via Policy Distillation , 2019, ArXiv.

[58]  Dahua Lin,et al.  Learning a Unified Classifier Incrementally via Rebalancing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Yandong Guo,et al.  Large Scale Incremental Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[61]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[62]  David Filliat,et al.  Continual Reinforcement Learning deployed in Real-life using Policy Distillation and Sim2Real Transfer , 2019, ArXiv.

[63]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[64]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[65]  Yoshua Bengio,et al.  Gradient based sample selection for online continual learning , 2019, NeurIPS.

[66]  Tom Diethe,et al.  Continual Learning in Practice , 2019, NeurIPS 2019.

[67]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.

[68]  Nathan D. Cahill,et al.  Memory Efficient Experience Replay for Streaming Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[69]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[70]  Adrian Popescu,et al.  DeeSIL: Deep-Shallow Incremental Learning , 2018, ECCV Workshops.

[71]  Cordelia Schmid,et al.  End-to-End Incremental Learning , 2018, ECCV.

[72]  Yee Whye Teh,et al.  Progress & Compress: A scalable framework for continual learning , 2018, ICML.

[73]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[74]  Marcus Rohrbach,et al.  Memory Aware Synapses: Learning what (not) to forget , 2017, ECCV.

[75]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[76]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[78]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[79]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[80]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[82]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[83]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[85]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[86]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[88]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[89]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[91]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[92]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[93]  Shaoning Pang,et al.  Incremental linear discriminant analysis for classification of data streams , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).