Optimization and Generalization of Regularization-Based Continual Learning: a Loss Approximation Viewpoint.

Neural networks have achieved remarkable success in many cognitive tasks. However, when they are trained sequentially on multiple tasks without access to old data, their performance on early tasks tend to drop significantly. This problem is often referred to as catastrophic forgetting, a key challenge in continual learning of neural networks. The regularization-based approach is one of the primary classes of methods to alleviate catastrophic forgetting. In this paper, we provide a novel viewpoint of regularization-based continual learning by formulating it as a second-order Taylor approximation of the loss function of each task. This viewpoint leads to a unified framework that can be instantiated to derive many existing algorithms such as Elastic Weight Consolidation and Kronecker factored Laplace approximation. Based on this viewpoint, we study the optimization aspects (i.e., convergence) as well as generalization properties (i.e., finite-sample guarantees) of regularization-based continual learning. Our theoretical results indicate the importance of accurate approximation of the Hessian matrix. The experimental results on several benchmarks provide empirical validation of our theoretical findings.

[1]  S. Risi,et al.  Continual Learning through Evolvable Neural Turing Machines , 2016 .

[2]  Sung Ju Hwang,et al.  Lifelong Learning with Dynamically Expandable Networks , 2017, ICLR.

[3]  Mehrdad Farajtabar,et al.  Orthogonal Gradient Descent for Continual Learning , 2019, AISTATS.

[4]  Seyed Iman Mirzadeh,et al.  Understanding the Role of Training Regimes in Continual Learning , 2020, NeurIPS.

[5]  Laurent Itti,et al.  Closed-Loop GAN for continual Learning , 2018, IJCAI.

[6]  Masashi Sugiyama,et al.  Generalisation Guarantees for Continual Learning with Orthogonal Gradient Descent , 2020, ArXiv.

[7]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[8]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[9]  Hassan Ghasemzadeh,et al.  Dropout as an Implicit Gating Mechanism For Continual Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[11]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[12]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[13]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[14]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[15]  Jonathan Baxter,et al.  Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[16]  Pierre Alquier,et al.  Regret Bounds for Lifelong Learning , 2016, AISTATS.

[17]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[18]  Bohyung Han,et al.  Continual Learning by Asymmetric Loss Approximation With Single-Side Overestimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yee Whye Teh,et al.  Functional Regularisation for Continual Learning using Gaussian Processes , 2019, ICLR.

[20]  Yarin Gal,et al.  A Unifying Bayesian View of Continual Learning , 2019, ArXiv.

[21]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[22]  Richard E. Turner,et al.  Variational Continual Learning , 2017, ICLR.

[23]  Chi-Jen Lu,et al.  Lifelong Optimization with Low Regret , 2019, AISTATS.

[24]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[25]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[26]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[27]  Shai Ben-David,et al.  Multi-task and Lifelong Learning of Kernels , 2015, ALT.

[28]  Marc'Aurelio Ranzato,et al.  Efficient Lifelong Learning with A-GEM , 2018, ICLR.

[29]  Richard Socher,et al.  Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting , 2019, ICML.

[30]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[31]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[32]  Stefano Soatto,et al.  Toward Understanding Catastrophic Forgetting in Continual Learning , 2019, ArXiv.

[33]  Yuxin Peng,et al.  Error-Driven Incremental Learning in Deep Convolutional Neural Network for Large-Scale Image Classification , 2014, ACM Multimedia.

[34]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Yan Liu,et al.  Deep Generative Dual Memory Network for Continual Learning , 2017, ArXiv.

[37]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[38]  Eric Eaton,et al.  ELLA: An Efficient Lifelong Learning Algorithm , 2013, ICML.

[39]  Marcus Rohrbach,et al.  Memory Aware Synapses: Learning what (not) to forget , 2017, ECCV.

[40]  Sergey Levine,et al.  Online Meta-Learning , 2019, ICML.

[41]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[42]  Thomas L. Griffiths,et al.  Reconciling meta-learning and continual learning with online mixtures of tasks , 2018, NeurIPS.

[43]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[44]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[45]  Karthik Sridharan,et al.  Uniform Convergence of Gradients for Non-Convex Learning and Optimization , 2018, NeurIPS.

[46]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.