Improved Regularization and Robustness for Fine-tuning in Neural Networks

A widely used algorithm for transfer learning is fine-tuning, where a pre-trained model is fine-tuned on a target task with a small amount of labeled data. When the capacity of the pre-trained model is much larger than the size of the target data set, fine-tuning is prone to overfitting and “memorizing” the training labels. Hence, an important question is to regularize fine-tuning and ensure its robustness to noise. To address this question, we begin by analyzing the generalization properties of fine-tuning. We present a PAC-Bayes generalization bound that depends on the distance traveled in each layer during fine-tuning and the noise stability of the fine-tuned model. We empirically measure these quantities. Based on the analysis, we propose regularized self-labeling—the interpolation between regularization and self-labeling methods, including (i) layer-wise regularization to constrain the distance traveled in each layer; (ii) self label-correction and label-reweighting to correct mislabeled data points (that the model is confident) and reweight less confident data points. We validate our approach on an extensive collection of image and text data sets using multiple pre-trained model architectures. Our approach improves baseline methods by 1.76% (on average) for seven image classification tasks and 0.75% for a few-shot classification task. When the target data set includes noisy labels, our approach outperforms baseline methods by 3.56% on average in two noisy settings.

[1]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[2]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[3]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[5]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[7]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[8]  Hossein Mobahi,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[9]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[10]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[11]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[12]  Haoyi Xiong,et al.  DELTA: DEep Learning Transfer using Feature Map with Attention for Convolutional Networks , 2019, ICLR.

[13]  Philip M. Long,et al.  Generalization bounds for deep convolutional neural networks , 2019, ICLR.

[14]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[15]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[16]  Yue Wang,et al.  Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? , 2020, ECCV.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Colin Wei,et al.  Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin , 2019, ICLR.

[20]  Stefano Soatto,et al.  A Baseline for Few-Shot Image Classification , 2019, ICLR.

[21]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[22]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[23]  Peter Bailis,et al.  Sinkhorn Label Allocation: Semi-Supervised Classification via Annealed Self-Training , 2021, ICML.

[24]  Frederic Sala,et al.  Training Complex Models with Multi-Task Weak Supervision , 2018, AAAI.

[25]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[26]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[27]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[28]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[29]  James Bailey,et al.  Normalized Loss Functions for Deep Learning with Noisy Labels , 2020, ICML.

[30]  Ashish Kapoor,et al.  Do Adversarially Robust ImageNet Models Transfer Better? , 2020, NeurIPS.

[31]  Daniel L. Rubin,et al.  Observational Supervision for Medical Image Classification Using Gaze Data , 2021, MICCAI.

[32]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[33]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[34]  Sheng Liu,et al.  Early-Learning Regularization Prevents Memorization of Noisy Labels , 2020, NeurIPS.

[35]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[36]  Xiaofeng Liu,et al.  Confidence Regularized Self-Training , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  James Bailey,et al.  Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[39]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[40]  Sen Wu,et al.  Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[41]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[42]  Colin Wei,et al.  Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation , 2019, NeurIPS.

[43]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[44]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[45]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[46]  Xinyang Chen,et al.  Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning , 2019, NeurIPS.

[47]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[48]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[49]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[50]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[51]  Stefano Soatto,et al.  Rethinking the Hyperparameters for Fine-tuning , 2020, ICLR.

[52]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[53]  Weijie J. Su,et al.  Analysis of Information Transfer from Heterogeneous Sources via Precise High-dimensional Asymptotics , 2020, 2010.11750.

[54]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[55]  Hongyang R. Zhang,et al.  Self-Adaptive Training: beyond Empirical Risk Minimization , 2020, NeurIPS.

[56]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[57]  J. Zico Kolter,et al.  Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience , 2019, ICLR.

[58]  Massimiliano Pontil,et al.  Distance-Based Regularisation of Deep Networks for Fine-Tuning , 2021, ICLR.

[59]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[60]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Rong Jin,et al.  Dash: Semi-Supervised Learning with Dynamic Thresholding , 2021, ICML.

[63]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[64]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[65]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[66]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[67]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.