Understanding the Role of Training Regimes in Continual Learning

Catastrophic forgetting affects the training of neural networks, limiting their ability to learn multiple tasks sequentially. From the perspective of the well established plasticity-stability dilemma, neural networks tend to be overly plastic, lacking the stability necessary to prevent the forgetting of previous knowledge, which means that as learning progresses, networks tend to forget previously seen tasks. This phenomenon coined in the continual learning literature, has attracted much attention lately, and several families of approaches have been proposed with different degrees of success. However, there has been limited prior work extensively analyzing the impact that different training regimes -- learning rate, batch size, regularization method-- can have on forgetting. In this work, we depart from the typical approach of altering the learning algorithm to improve stability. Instead, we hypothesize that the geometrical properties of the local minima found for each task play an important role in the overall degree of forgetting. In particular, we study the effect of dropout, learning rate decay, and batch size, on forming training regimes that widen the tasks' local minima and consequently, on helping it not to forget catastrophically. Our study provides practical insights to improve stability via simple yet effective techniques that outperform alternative baselines.

[1]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[3]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[4]  Sham M. Kakade,et al.  The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure , 2019, NeurIPS.

[5]  Hassan Ghasemzadeh,et al.  Dropout as an Implicit Gating Mechanism For Continual Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Nicolas Y. Masse,et al.  Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization , 2018, Proceedings of the National Academy of Sciences.

[7]  Laurent Itti,et al.  Closed-Loop Memory GAN for Continual Learning , 2018, IJCAI.

[8]  Jiashi Feng,et al.  Variational Prototype Replays for Continual Learning , 2019 .

[9]  David J. Schwab,et al.  The Early Phase of Neural Network Training , 2020, ICLR.

[10]  Yoshua Bengio,et al.  Gradient based sample selection for online continual learning , 2019, NeurIPS.

[11]  Yee Whye Teh,et al.  Functional Regularisation for Continual Learning using Gaussian Processes , 2019, ICLR.

[12]  Yarin Gal,et al.  Towards Robust Evaluations of Continual Learning , 2018, ArXiv.

[13]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[14]  Martial Mermillod,et al.  The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects , 2013, Front. Psychol..

[15]  Shan Yu,et al.  Continual learning of context-dependent processing in neural networks , 2018, Nature Machine Intelligence.

[16]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[17]  Philip M. Long,et al.  Surprising properties of dropout in deep networks , 2017, COLT.

[18]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[19]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[20]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[21]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[22]  Xu Jia,et al.  Continual learning: A comparative study on how to defy forgetting in classification tasks , 2019, ArXiv.

[23]  Jian Pei,et al.  Demystifying Dropout , 2019, ICML.

[24]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[25]  Xu He,et al.  Overcoming Catastrophic Interference using Conceptor-Aided Backpropagation , 2018, ICLR.

[26]  Yee Whye Teh,et al.  Task Agnostic Continual Learning via Meta Learning , 2019, ArXiv.

[27]  Albert Gordo,et al.  Using Hindsight to Anchor Past Knowledge in Continual Learning , 2019, AAAI.

[28]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[29]  Murray Shanahan,et al.  Policy Consolidation for Continual Reinforcement Learning , 2019, ICML.

[30]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[31]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[32]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[33]  Nitish Srivastava,et al.  Improving Neural Networks with Dropout , 2013 .

[34]  Sanjeev Arora,et al.  An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[35]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[36]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[37]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[38]  Trevor Darrell,et al.  Uncertainty-guided Continual Learning with Bayesian Neural Networks , 2019, ICLR.

[39]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[42]  Thomas L. Griffiths,et al.  Reconciling meta-learning and continual learning with online mixtures of tasks , 2018, NeurIPS.

[43]  Ronald Kemker,et al.  Measuring Catastrophic Forgetting in Neural Networks , 2017, AAAI.

[44]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[45]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[46]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[47]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[48]  Masashi Sugiyama,et al.  A Diffusion Theory For Minima Selection: Stochastic Gradient Descent Escapes Sharp Minima Exponentially Fast , 2020 .

[49]  Yee Whye Teh,et al.  Continual Unsupervised Representation Learning , 2019, NeurIPS.

[50]  Joel Lehman,et al.  Learning to Continually Learn , 2020, ECAI.

[51]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Carlo Luschi,et al.  Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[53]  Svetlana Lazebnik,et al.  PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Masashi Sugiyama,et al.  A Diffusion Theory for Deep Learning Dynamics: Stochastic Gradient Descent Escapes From Sharp Minima Exponentially Fast , 2020, ArXiv.

[55]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[56]  Richard Socher,et al.  Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting , 2019, ICML.

[57]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[58]  Jeffrey L. Krichmar,et al.  Attention-Based Structural-Plasticity , 2019, ArXiv.

[59]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[60]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[61]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[62]  Marc'Aurelio Ranzato,et al.  On Tiny Episodic Memories in Continual Learning , 2019 .

[63]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[64]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[65]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[66]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[67]  Surya Ganguli,et al.  Emergent properties of the local geometry of neural loss landscapes , 2019, ArXiv.

[68]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69]  Yanshuai Cao,et al.  Few-Shot Self Reminder to Overcome Catastrophic Forgetting , 2018, ArXiv.

[70]  Yen-Cheng Liu,et al.  Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines , 2018, ArXiv.

[71]  Mehrdad Farajtabar,et al.  Orthogonal Gradient Descent for Continual Learning , 2019, AISTATS.

[72]  Laurent Itti,et al.  Closed-Loop GAN for continual Learning , 2018, IJCAI.

[73]  Yoshua Bengio,et al.  Online continual learning with no task boundaries , 2019, ArXiv.

[74]  Marcus Rohrbach,et al.  Memory Aware Synapses: Learning what (not) to forget , 2017, ECCV.

[75]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[76]  Yee Whye Teh,et al.  Progress & Compress: A scalable framework for continual learning , 2018, ICML.

[77]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[78]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[79]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.

[80]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.