Learn Faster and Forget Slower via Fast and Stable Task Adaptation

Training Deep Neural Networks (DNNs) is still highly time-consuming and compute-intensive. It has been shown that adapting a pretrained model may significantly accelerate this process. With a focus on classification, we show that current fine-tuning techniques make the pretrained models catastrophically forget the transferred knowledge even before anything about the new task is learned. Such rapid knowledge loss undermines the merits of transfer learning and may result in a much slower convergence rate compared to when the maximum amount of knowledge is exploited. We investigate the source of this problem from different perspectives and to alleviate it, introduce Fast And Stable Task-adaptation (FAST), an easy to apply fine-tuning algorithm. The paper provides a novel geometric perspective on how the loss landscape of source and target tasks are linked in different transfer learning strategies. We empirically show that compared to prevailing fine-tuning practices, FAST learns the target task faster and forgets the source task slower. The code is available at this https URL.

[1]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[2]  David Faust,et al.  Eliminating the hindsight bias. , 1988 .

[3]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[4]  Mark S. Boddy,et al.  Solving Time-Dependent Planning Problems , 1989, IJCAI.

[5]  J. Schmidhuber,et al.  A neural network that embeds its own meta-levels , 1993, IEEE International Conference on Neural Networks.

[6]  Richard E. Korf,et al.  A Complete Anytime Algorithm for Number Partitioning , 1998, Artif. Intell..

[7]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[8]  J. Kruger,et al.  Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments. , 1999, Journal of personality and social psychology.

[9]  Sabina Kleitman,et al.  The Role of Individual Differences in the Accuracy of Confidence Judgments , 2002, The Journal of general psychology.

[10]  Anders Winman,et al.  Subjective probability intervals: how to reduce overconfidence by interval evaluation. , 2004, Journal of experimental psychology. Learning, memory, and cognition.

[11]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[12]  Alan C. Weinstein,et al.  Anti-Defamation League of B'Nai B'rith , 2006 .

[13]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[14]  S. Thompson Social Learning Theory , 2008 .

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Fiona Fidler,et al.  Reducing Overconfidence in the Interval Judgments of Experts , 2010, Risk analysis : an official publication of the Society for Risk Analysis.

[18]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[20]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[21]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[22]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[25]  D. Runia,et al.  Title of the Work , 2019, Philo of Alexandria: On the Life of Abraham.

[26]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[27]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[28]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[31]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[32]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Florent Meyniel,et al.  The Sense of Confidence during Probabilistic Learning: A Normative Account , 2015, PLoS Comput. Biol..

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Trevor Darrell,et al.  Best Practices for Fine-Tuning Visual Classifiers to New Domains , 2016, ECCV Workshops.

[43]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[44]  Xinbo Chen,et al.  Evaluating the Energy Efficiency of Deep Convolutional Neural Networks on CPUs and GPUs , 2016, 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom).

[45]  Joachim Denzler,et al.  Impatient DNNs - Deep Neural Networks with Dynamic Time Budgets , 2016, BMVC.

[46]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[47]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[48]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[49]  Yang Zhong,et al.  Face attribute prediction using off-the-shelf CNN features , 2016, 2016 International Conference on Biometrics (ICB).

[50]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[53]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[55]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[56]  David Barber,et al.  Nesterov's accelerated gradient and momentum as approximations to regularised update descent , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[57]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[58]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[59]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[60]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[61]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[62]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[64]  David Dunning,et al.  Overconfidence Among Beginners: Is a Little Learning a Dangerous Thing? , 2018, Journal of personality and social psychology.

[65]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[66]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[67]  Yiyang Zhao,et al.  AlphaX: eXploring Neural Architectures with Deep Neural Networks and Monte Carlo Tree Search , 2019, ArXiv.

[68]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[69]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[70]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[71]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[73]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[74]  Guojun Lu,et al.  Transfer Learning Using Classification Layer Features of CNN , 2018 .

[75]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[77]  Aaron Klein,et al.  Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search , 2018, ArXiv.

[78]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[79]  Ondrej Bojar,et al.  Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.

[80]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[81]  Jianfeng Zhan,et al.  Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks , 2017, ICANN.

[82]  Matthew A. Brown,et al.  Low-Shot Learning with Imprinted Weights , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[83]  Yen-Cheng Liu,et al.  Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines , 2018, ArXiv.

[84]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[85]  Lu Lu,et al.  Dying ReLU and Initialization: Theory and Numerical Examples , 2019, Communications in Computational Physics.

[86]  Jindong Wang,et al.  Easy Transfer Learning By Exploiting Intra-Domain Structures , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[87]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[88]  Stan Matwin,et al.  Efficient Neural Task Adaptation by Maximum Entropy Initialization , 2019, ArXiv.

[89]  Zhi Zhang,et al.  Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  M. Burtsev,et al.  Loss Surface Sightseeing by Multi-Point Optimization , 2019 .

[91]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[92]  Debadeepta Dey,et al.  Learning Anytime Predictions in Neural Networks via Adaptive Loss Balancing , 2017, AAAI.

[93]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[94]  Yannis Avrithis,et al.  Dense Classification and Implanting for Few-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Yoshua Bengio,et al.  The Benefits of Over-parameterization at Initialization in Deep ReLU Networks , 2019, ArXiv.

[96]  M. Maire,et al.  ALERT: Accurate Anytime Learning for Energy and Timeliness , 2019, ArXiv.

[97]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Radu Calinescu,et al.  Assuring the Machine Learning Lifecycle , 2019, ACM Comput. Surv..

[99]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[100]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[101]  Liang Zhao,et al.  Interpreting and Evaluating Neural Network Robustness , 2019, IJCAI.

[102]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[103]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[104]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[105]  Wojciech M. Czarnecki,et al.  A Deep Neural Network's Loss Surface Contains Every Low-dimensional Pattern , 2019, ArXiv.

[106]  Yao Zhang,et al.  Geometry of energy landscapes and the optimizability of deep neural networks , 2018, Physical review letters.

[107]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Willem Zuidema,et al.  Transferring Inductive Biases through Knowledge Distillation , 2020, ArXiv.

[109]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[110]  Mohammad Havaei,et al.  Continuous Domain Adaptation with Variational Domain-Agnostic Feature Replay , 2020, ArXiv.

[111]  Michael Maire,et al.  ALERT: Accurate Learning for Energy and Timeliness , 2019, USENIX Annual Technical Conference.

[112]  Sashank J. Reddi,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[113]  R. Thomas McCoy,et al.  Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks , 2020, TACL.

[114]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[115]  Stefano Soatto,et al.  A Baseline for Few-Shot Image Classification , 2019, ICLR.