Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that—surprisingly—in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must—at the very least—be measured with pairwise interference.

[1]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[2]  Yoshua Bengio,et al.  On Catastrophic Interference in Atari 2600 Games , 2020, ArXiv.

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  Yoshua Bengio,et al.  Toward Training Recurrent Neural Networks for Lifelong Learning , 2018, Neural Computation.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[7]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Seyed Iman Mirzadeh,et al.  Understanding the Role of Training Regimes in Continual Learning , 2020, NeurIPS.

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[12]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[13]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[18]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[19]  B. Underwood,et al.  Fate of first-list associations in transfer theory. , 1959, Journal of experimental psychology.

[20]  Mark W. Spong,et al.  Robot dynamics and control , 1989 .

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Adam White,et al.  Improving Performance in Reinforcement Learning by Breaking Generalization in Neural Networks , 2020, AAMAS.

[23]  Yarin Gal,et al.  Towards Robust Evaluations of Continual Learning , 2018, ArXiv.

[24]  Mark W. Spong,et al.  Swinging up the Acrobot: an example of intelligent control , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[25]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[26]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[27]  Vincent Liu,et al.  Sparse Representation Neural Networks for Online Reinforcement Learning , 2019 .

[28]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[29]  Hermann Ebbinghaus (1885) Memory: A Contribution to Experimental Psychology , 2013, Annals of Neurosciences.

[30]  Robert M. French,et al.  Using Semi-Distributed Representations to Overcome Catastrophic Forgetting in Connectionist Networks , 1991 .

[31]  Nicolas Y. Masse,et al.  Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization , 2018, Proceedings of the National Academy of Sciences.

[32]  Bing Liu,et al.  Lifelong machine learning: a paradigm for continuous learning , 2017, Frontiers of Computer Science.

[33]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[34]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[35]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[36]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[39]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[40]  Richard S. Sutton,et al.  A First Empirical Study of Emphatic Temporal Difference Learning , 2017, ArXiv.

[41]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .