Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics

Catastrophic forgetting is a recurring challenge to developing versatile deep learning models. Despite its ubiquity, there is limited understanding of its connections to neural network (hidden) representations and task semantics. In this paper, we address this important knowledge gap. Through quantitative analysis of neural representations, we find that deeper layers are disproportionately responsible for forgetting, with sequential training resulting in an erasure of earlier task representational subspaces. Methods to mitigate forgetting stabilize these deeper layers, but show diversity on precise effects, with some increasing feature reuse while others store task representations orthogonally, preventing interference. These insights also enable the development of an analytic argument and empirical picture relating forgetting to task semantic similarity, where we find that maximal forgetting occurs for task sequences with intermediate similarity.

[1]  Ankur Bapna,et al.  Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges , 2019, ArXiv.

[2]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[3]  Richard Socher,et al.  A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.

[4]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[5]  Ankur Bapna,et al.  Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.

[6]  Richard Socher,et al.  Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting , 2019, ICML.

[7]  Elahe Rahimtoroghi,et al.  What Happens To BERT Embeddings During Fine-tuning? , 2020, BLACKBOXNLP.

[8]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[9]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[10]  Stephan Günnemann,et al.  Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift , 2018, NeurIPS.

[11]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[12]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[13]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[14]  Surya Ganguli,et al.  A mathematical theory of semantic development in deep neural networks , 2018, Proceedings of the National Academy of Sciences.

[15]  Alexandros Karatzoglou,et al.  Overcoming Catastrophic Forgetting with Hard Attention to the Task , 2018 .

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[18]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[19]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[20]  Guy Gur-Ari,et al.  Wider Networks Learn Better Features , 2019, ArXiv.

[21]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[23]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[24]  Ethan Dyer,et al.  Affinity and Diversity: Quantifying Mechanisms of Data Augmentation , 2020, ArXiv.

[25]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[28]  OctoMiao Overcoming catastrophic forgetting in neural networks , 2016 .

[29]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[30]  Anthony V. Robins,et al.  Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[31]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.

[32]  Pascal Vincent,et al.  Visualizing Higher-Layer Features of a Deep Network , 2009 .

[33]  Maithra Raghu,et al.  A Survey of Deep Learning for Scientific Discovery , 2020, ArXiv.