Posterior Meta-Replay for Continual Learning

Continual Learning (CL) algorithms have recently received a lot of attention as they attempt to overcome the need to train with an i.i.d. sample from some unknown target data distribution. Building on prior work, we study principled ways to tackle the CL problem by adopting a Bayesian perspective and focus on continually learning a task-specific posterior distribution via a shared meta-model, a task-conditioned hypernetwork. This approach, which we term Posterior-replay CL, is in sharp contrast to most Bayesian CL approaches that focus on the recursive update of a single posterior distribution. The benefits of our approach are (1) an increased flexibility to model solutions in weight space and therewith less susceptibility to task dissimilarity, (2) access to principled task-specific predictive uncertainty estimates, that can be used to infer task identity during test time and to detect task boundaries during training, and (3) the ability to revisit and update task-specific posteriors in a principled manner without requiring access to past data. The proposed framework is versatile, which we demonstrate using simple posterior approximations (such as Gaussians) as well as powerful, implicit distributions modelled via a neural network. We illustrate the conceptual advance of our framework on low-dimensional problems and show performance gains on computer vision benchmarks.

[1]  Joost van de Weijer,et al.  Class-Incremental Learning: Survey and Performance Evaluation on Image Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Visvanathan Ramesh,et al.  A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning , 2020, Neural Networks.

[3]  A. Torralba,et al.  Energy-Based Models for Continual Learning , 2020, CoLLAs.

[4]  A. Tolias,et al.  Class-Incremental Learning with Generative Classifiers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[5]  Lawrence Carin,et al.  Efficient Feature Transformations for Discriminative and Generative Continual Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Richard E. Turner,et al.  Generalized Variational Continual Learning , 2020, ICLR.

[7]  Matthias De Lange,et al.  Continual Prototype Evolution: Learning Online from Non-Stationary Data Streams , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Benjamin F. Grewe,et al.  Neural networks with late-phase weights , 2020, ICLR.

[9]  Benjamin F. Grewe,et al.  Continual learning in recurrent neural networks , 2020, ICLR.

[10]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[11]  Shirin Enshaeifar,et al.  Continual Learning Using Bayesian Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Elad Hoffer,et al.  Task Agnostic Continual Learning Using Online Variational Bayes , 2018, 1803.10123.

[13]  Christian Henning,et al.  Uncertainty-based out-of-distribution detection requires suitable function space priors , 2021, ArXiv.

[14]  Aodong Li,et al.  Variational Beam Search for Novelty Detection , 2021 .

[15]  Philip H. S. Torr,et al.  Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty , 2021, ArXiv.

[16]  Vineeth N Balasubramanian,et al.  Meta-Consolidation for Continual Learning , 2020, NeurIPS.

[17]  Philip H. S. Torr,et al.  GDumb: A Simple Approach that Questions Our Progress in Continual Learning , 2020, ECCV.

[18]  Hava T. Siegelmann,et al.  Brain-inspired replay for continual learning with artificial neural networks , 2020, Nature Communications.

[19]  Ali Farhadi,et al.  Supermasks in Superposition , 2020, NeurIPS.

[20]  Dustin Tran,et al.  Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness , 2020, NeurIPS.

[21]  Andreas Krause,et al.  Coresets via Bilevel Optimization for Continual Learning and Streaming , 2020, NeurIPS.

[22]  Benjamin Van Roy,et al.  Hypermodels for Exploration , 2020, ICLR.

[23]  Botond Cseke,et al.  Continual Learning with Bayesian Neural Networks for Non-Stationary Data , 2020, ICLR.

[24]  Hod Lipson,et al.  Principled Weight Initialization for Hypernetworks , 2020, ICLR.

[25]  Mohammad Emtiyaz Khan,et al.  Continual Deep Learning by Functional Regularisation of Memorable Past , 2020, NeurIPS.

[26]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[27]  Thang D. Bui,et al.  Hierarchical Gaussian Process Priors for Bayesian Neural Network Weights , 2020, NeurIPS.

[28]  Bastiaan S. Veeling,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[29]  Junsoo Ha,et al.  A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning , 2020, ICLR.

[30]  Richard E. Turner,et al.  Continual Learning with Adaptive Weights (CLAW) , 2019, ICLR.

[31]  Michael A. Osborne,et al.  Radial Bayesian Neural Networks: Beyond Discrete Support In Large-Scale Bayesian Deep Learning , 2019, AISTATS.

[32]  Benjamin F. Grewe,et al.  Continual learning with hypernetworks , 2019, ICLR.

[33]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[34]  Yee Whye Teh,et al.  Functional Regularisation for Continual Learning using Gaussian Processes , 2019, ICLR.

[35]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[36]  CONTINUAL LEARNING WITHOUT KNOWING TASK IDENTITIES: DO SIMPLE MODELS WORK? , 2020 .

[37]  Tinne Tuytelaars,et al.  Online Continual Learning with Maximally Interfered Retrieval , 2019, ArXiv.

[38]  Yee Whye Teh,et al.  Task Agnostic Continual Learning via Meta Learning , 2019, ArXiv.

[39]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[40]  Richard E. Turner,et al.  Improving and Understanding Variational Continual Learning , 2019, ArXiv.

[41]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[42]  Marc'Aurelio Ranzato,et al.  Continual Learning with Tiny Episodic Memories , 2019, ArXiv.

[43]  Yarin Gal,et al.  A Unifying Bayesian View of Continual Learning , 2019, ArXiv.

[44]  Matthias Hein,et al.  Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yee Whye Teh,et al.  Do Deep Generative Models Know What They Don't Know? , 2018, ICLR.

[46]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2018, Neural Networks.

[47]  Benjamin F. Grewe,et al.  Approximating the Predictive Distribution via Adversarially-Trained Hypernetworks , 2018 .

[48]  Guang Cheng,et al.  Stein Neural Sampler , 2018, ArXiv.

[49]  Andreas S. Tolias,et al.  Generative replay with feedback connections as a general strategy for continual learning , 2018, ArXiv.

[50]  Peter Dayan,et al.  Probabilistic Meta-Representations Of Neural Networks , 2018, ArXiv.

[51]  Jun Zhu,et al.  A Spectral Approach to Gradient Estimation for Implicit Distributions , 2018, ICML.

[52]  Yarin Gal,et al.  Towards Robust Evaluations of Continual Learning , 2018, ArXiv.

[53]  Yee Whye Teh,et al.  Progress & Compress: A scalable framework for continual learning , 2018, ICML.

[54]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[55]  Yarin Gal,et al.  Understanding Measures of Uncertainty for Adversarial Example Detection , 2018, UAI.

[56]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[57]  Ferenc Huszár,et al.  Note on the quadratic penalties in elastic weight consolidation , 2017, Proceedings of the National Academy of Sciences.

[58]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[59]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[60]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[61]  Richard E. Turner,et al.  Gradient Estimators for Implicit Models , 2017, ICLR.

[62]  Ben Glocker,et al.  Implicit Weight Uncertainty in Neural Networks. , 2017 .

[63]  Alexandre Lacoste,et al.  Bayesian Hypernetworks , 2017, ArXiv.

[64]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[65]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[66]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[67]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[68]  Ferenc Huszár,et al.  Variational Inference using Implicit Distributions , 2017, ArXiv.

[69]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[70]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[71]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[72]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[73]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[75]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[76]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[77]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[78]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[79]  Qiang Liu,et al.  A Kernelized Stein Discrepancy for Goodness-of-fit Tests , 2016, ICML.

[80]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[81]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[83]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[84]  Max Welling,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[85]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[86]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[87]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[88]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[89]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[90]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[91]  Frank Nielsen,et al.  Cramer-Rao Lower Bound and Information Geometry , 2013, ArXiv.

[92]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[93]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[94]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[95]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[96]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[97]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[98]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[99]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[100]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.