Wide Neural Networks Forget Less Catastrophically

A growing body of research in continual learning is devoted to overcoming the “Catastrophic Forgetting” of neural networks by designing new algorithms that are more robust to the distribution shifts. While the recent progress in continual learning literature is encouraging, our understanding of what properties of neural networks contribute to catastrophic forgetting is still limited. To address this, instead of focusing on continual learning algorithms, in this work, we focus on the model itself and study the impact of “width” of the neural network architecture on catastrophic forgetting, and show that width has a surprisingly significant effect on forgetting. To explain this effect, we study the learning dynamics of the network from various perspectives such as gradient norm and sparsity, orthogonalization, and lazy training regime. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.

[1]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[2]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[3]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[4]  Sanket Vaibhav Mehta,et al.  An Empirical Investigation of the Role of Pre-training in Lifelong Learning , 2021, ArXiv.

[5]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Mehrdad Farajtabar,et al.  Orthogonal Gradient Descent for Continual Learning , 2019, AISTATS.

[8]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[9]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[10]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[11]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[12]  Albert Gordo,et al.  Using Hindsight to Anchor Past Knowledge in Continual Learning , 2019, AAAI.

[13]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[14]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[15]  Yee Whye Teh,et al.  Functional Regularisation for Continual Learning using Gaussian Processes , 2019, ICLR.

[16]  Philip H. S. Torr,et al.  Continual Learning in Low-rank Orthogonal Subspaces , 2020, NeurIPS.

[17]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[18]  Yen-Cheng Liu,et al.  Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines , 2018, ArXiv.

[19]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[20]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[21]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[22]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[23]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.

[24]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.

[26]  Sebastian Thrun,et al.  A lifelong learning perspective for mobile robot control , 1994, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS'94).

[27]  Ludovic Denoyer,et al.  Efficient Continual Learning with Modular Networks and Task-Driven Priors , 2020, ArXiv.

[28]  Seyed Iman Mirzadeh,et al.  Linear Mode Connectivity in Multitask and Continual Learning , 2020, ICLR.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[31]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[32]  Ali Farhadi,et al.  Supermasks in Superposition , 2020, NeurIPS.

[33]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[34]  Michael Bendersky,et al.  Dynamic Language Models for Continuously Evolving Content , 2021, KDD.

[35]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[36]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[37]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[38]  Pierre Alquier,et al.  A Theoretical Analysis of Catastrophic Forgetting through the NTK Overlap Matrix , 2020, AISTATS.

[39]  Joel Lehman,et al.  Learning to Continually Learn , 2020, ECAI.

[40]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[41]  Marcus Rohrbach,et al.  Memory Aware Synapses: Learning what (not) to forget , 2017, ECCV.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Richard Socher,et al.  Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting , 2019, ICML.

[44]  Marc'Aurelio Ranzato,et al.  On Tiny Episodic Memories in Continual Learning , 2019 .

[45]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[46]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[47]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[48]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[49]  Ali Farhadi,et al.  In the Wild: From ML Models to Pragmatic ML Systems , 2020, ArXiv.

[50]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[53]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[54]  Seyed Iman Mirzadeh,et al.  Understanding the Role of Training Regimes in Continual Learning , 2020, NeurIPS.

[55]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[56]  Zhanxing Zhu,et al.  Reinforced Continual Learning , 2018, NeurIPS.

[57]  Leslie Pack Kaelbling,et al.  Modular meta-learning , 2018, CoRL.

[58]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.