A study on the plasticity of neural networks

One aim shared by multiple settings, such as continual learning or transfer learning, is to leverage previously acquired knowledge to converge faster on the current task. Usually this is done through fine-tuning, where an implicit assumption is that the network maintains its plasticity, meaning that the performance it can reach on any given task is not affected negatively by previously seen tasks. It has been observed recently that a pretrained model on data from the same distribution as the one it is fine-tuned on might not reach the same generalisation as a freshly initialised one. We build and extend this observation, providing a hypothesis for the mechanics behind it. We discuss the implication of losing plasticity for continual learning which heavily relies on optimising pretrained models.

[1]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[2]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[3]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[4]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[5]  Stefano Soatto,et al.  Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence , 2019, NeurIPS.

[6]  Shimon Whiteson,et al.  Transient Non-stationarity and Generalisation in Deep Reinforcement Learning , 2020, ICLR.

[7]  Svetlana Lazebnik,et al.  PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[10]  Philip H. S. Torr,et al.  Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence , 2018, ECCV.

[11]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12]  Jacek Tabor,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[13]  Ryan P. Adams,et al.  On the Difficulty of Warm-Starting Neural Network Training , 2019, ArXiv.

[14]  Stefano Soatto,et al.  Critical Learning Periods in Deep Networks , 2018, ICLR.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.