Maintaining Plasticity in Deep Continual Learning

Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail catastrophically to remember earlier examples. More fundamental, but less well known, is that they may also lose their ability to adapt to new data, a phenomenon called \textit{loss of plasticity}. We show loss of plasticity using the MNIST and ImageNet datasets repurposed for continual learning as sequences of tasks. In ImageNet, binary classification performance dropped from 89% correct on an early task down to 77%, or to about the level of a linear network, on the 2000th task. Such loss of plasticity occurred with a wide range of deep network architectures, optimizers, and activation functions, and was not eased by batch normalization or dropout. In our experiments, loss of plasticity was correlated with the proliferation of dead units, with very large weights, and more generally with a loss of unit diversity. Loss of plasticity was substantially eased by $L^2$-regularization, particularly when combined with weight perturbation (Shrink and Perturb). We show that plasticity can be fully maintained by a new algorithm -- called $\textit{continual backpropagation}$ -- which is just like conventional backpropagation except that a small fraction of less-used units are reinitialized after each example.

[1]  Pierre-Luc Bacon,et al.  The Primacy Bias in Deep Reinforcement Learning , 2022, ICML.

[2]  Mark Rowland,et al.  Understanding and Preventing Capacity Loss in Reinforcement Learning , 2022, ICLR.

[3]  P. Stone,et al.  Dynamic Sparse Training for Deep Reinforcement Learning , 2021, IJCAI.

[4]  Zhe Gan,et al.  Chasing Sparsity in Vision Transformers: An End-to-End Exploration , 2021, NeurIPS.

[5]  Razvan Pascanu,et al.  A study on the plasticity of neural networks , 2021, ArXiv.

[6]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[7]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[8]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[9]  Marc'Aurelio Ranzato,et al.  Efficient Continual Learning with Modular Networks and Task-Driven Priors , 2020, ICLR.

[10]  Sergey Levine,et al.  Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , 2020, ICLR.

[11]  Shimon Whiteson,et al.  Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , 2020, ICLR.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[14]  Ray C. C. Cheung,et al.  Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers , 2020, ICLR.

[15]  Fahad Shahbaz Khan,et al.  Random Path Selection for Continual Learning , 2019, NeurIPS.

[16]  Ryan P. Adams,et al.  On Warm-Starting Neural Network Training , 2019, NeurIPS.

[17]  Tinne Tuytelaars,et al.  Online Continual Learning with Maximally Interfered Retrieval , 2019, ArXiv.

[18]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[19]  Michael James,et al.  Online Normalization for Training Neural Networks , 2019, NeurIPS.

[20]  Kyunghyun Cho,et al.  Continual Learning via Neural Pruning , 2019, ArXiv.

[21]  G. Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[22]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[23]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[24]  Sergey Levine,et al.  Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning , 2018, ICLR.

[25]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[26]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2018, Neural Networks.

[27]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[28]  Dawn Xiaodong Song,et al.  Gradients explode - Deep Networks are shallow - ResNet explained , 2017, ICLR.

[29]  Jaime G. Carbonell,et al.  The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions , 2017 .

[30]  David Kappel,et al.  Deep Rewiring: Training very sparse deep networks , 2017, ICLR.

[31]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[33]  Frank Hutter,et al.  A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.

[34]  Martial Hebert,et al.  Growing a Brain: Fine-Tuning by Increasing Model Capacity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[36]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[37]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[38]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[39]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[40]  Rui Peng,et al.  Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures , 2016, ArXiv.

[41]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[42]  Andries Petrus Engelbrecht,et al.  Measuring Saturation in Neural Networks , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[43]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[44]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[45]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[46]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[47]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[48]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[49]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[53]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[54]  Richard S. Sutton,et al.  Representation Search through Generate and Test , 2013, AAAI Workshop: Learning Rich Representations from Low-Level Sensors.

[55]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[56]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[58]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[59]  Geoffrey E. Hinton,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[60]  Honglak Lee,et al.  Online Incremental Feature Learning with Denoising Autoencoders , 2012, AISTATS.

[61]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[62]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[63]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Martin Vetterli,et al.  The effective rank: A measure of effective dimensionality , 2007, 2007 15th European Signal Processing Conference.

[65]  Christopher Barry,et al.  The influence of age of acquisition in word reading and other tasks : A never ending story ? , 2004 .

[66]  Mark S. Seidenberg,et al.  Age of Acquisition Effects in Word Reading and Other Tasks , 2002 .

[67]  M. L. Lambon Ralph,et al.  Age of acquisition effects in adult lexical processing reflect loss of plasticity in maturing systems: insights from connectionist networks. , 2000, Journal of experimental psychology. Learning, memory, and cognition.

[68]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[69]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[70]  Mark B. Ring CHILD: A First Step Towards Continual Learning , 1997, Machine Learning.

[71]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[72]  Matthieu Geist,et al.  What Matters for On-Policy Deep Actor-Critic Methods? A Large-Scale Study , 2021, ICLR.

[73]  George Em Karniadakis,et al.  TRAINABILITY OF ReLU NETWORKS AND DATA-DEPENDENT INITIALIZATION , 2019, Journal of Machine Learning for Modeling and Computing.

[74]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[75]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[76]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[77]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[78]  Emile Fiesler,et al.  Evaluating pruning methods , 1995 .

[79]  Petri Koistinen,et al.  Using additive noise in back-propagation training , 1992, IEEE Trans. Neural Networks.

[80]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[81]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.