Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that—surprisingly—in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must—at the very least—be measured with pairwise interference.

[1]  Seyed Iman Mirzadeh,et al.  Understanding the Role of Training Regimes in Continual Learning , 2020, NeurIPS.

[2]  Adam White,et al.  Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks , 2020, AAMAS.

[3]  Yoshua Bengio,et al.  On Catastrophic Interference in Atari 2600 Games , 2020, ArXiv.

[4]  David T. Jones,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[5]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[6]  Yoshua Bengio,et al.  Toward Training Recurrent Neural Networks for Lifelong Learning , 2018, Neural Computation.

[7]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[8]  Yarin Gal,et al.  Towards Robust Evaluations of Continual Learning , 2018, ArXiv.

[9]  Nicolas Y. Masse,et al.  Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization , 2018, Proceedings of the National Academy of Sciences.

[10]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[11]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Richard S. Sutton,et al.  A First Empirical Study of Emphatic Temporal Difference Learning , 2017, ArXiv.

[14]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[15]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[16]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[17]  Bing Liu,et al.  Lifelong machine learning: a paradigm for continuous learning , 2017, Frontiers of Computer Science.

[18]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[20]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[26]  Hermann Ebbinghaus (1885) Memory: A Contribution to Experimental Psychology , 2013, Annals of Neurosciences.

[27]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[29]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[30]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[32]  Mark W. Spong,et al.  Swinging up the Acrobot: an example of intelligent control , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[33]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[34]  B. Underwood,et al.  Fate of first-list associations in transfer theory. , 1959, Journal of experimental psychology.

[35]  Vincent Liu,et al.  Sparse Representation Neural Networks for Online Reinforcement Learning , 2019 .

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[39]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40]  Robert M. French,et al.  Using Semi-Distributed Representations to Overcome Catastrophic Forgetting in Connectionist Networks , 1991 .

[41]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[42]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[43]  Mark W. Spong,et al.  Robot dynamics and control , 1989 .