Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees

We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We investigate generalization properties of fine-tuning to understand the problem of overfitting, which often happens in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Next, we investigate the robustness of fine-tuning with noisy labels. We design an algorithm that incorporates consistent losses and distance-based regularization for fine-tuning. Additionally, we prove a generalization error bound of our algorithm under class conditional independent noise in the training dataset labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. For example, on six image classification tasks whose training labels are generated with programmatic labeling, we show a 3.26% accuracy improvement over prior methods. Meanwhile, the Hessian distance measure of the fine-tuned network using our algorithm decreases by six times more than existing approaches.

[1]  P. Chaudhari,et al.  Does the Data Induce Capacity Control in Deep Learning? , 2021, ICML.

[2]  Jong Wook Kim,et al.  Robust fine-tuning of zero-shot models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hwanjun Song,et al.  Learning From Noisy Labels With Deep Neural Networks: A Survey , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[4]  Edgar Dobriban,et al.  Learning Augmentation Distributions using Transformed Risk Minimization , 2021, ArXiv.

[5]  Dongyue Li,et al.  Improved Regularization and Robustness for Fine-tuning in Neural Networks , 2021, NeurIPS.

[6]  Pierre Alquier,et al.  User-friendly introduction to PAC-Bayes bounds , 2021, ArXiv.

[7]  Amir Globerson,et al.  A Theoretical Analysis of Fine-tuning with Linear Teachers , 2021, NeurIPS.

[8]  Qi Lei,et al.  Near-Optimal Linear Regression under Distribution Shift , 2021, ICML.

[9]  Hossein Azizpour,et al.  Generalized Jensen-Shannon Divergence Loss for Learning with Noisy Labels , 2021, NeurIPS.

[10]  Sanjeev Arora,et al.  Technical perspective: Why don't today's deep nets overfit to their training data? , 2021, Commun. ACM.

[11]  B. Recht,et al.  Patterns, predictions, and actions: A story about machine learning , 2021, ArXiv.

[12]  Renjie Liao,et al.  A PAC-Bayesian Approach to Generalization Bounds for Graph Neural Networks , 2020, ICLR.

[13]  Jingfei Du,et al.  Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning , 2020, ICLR.

[14]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[15]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[16]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2020, ArXiv.

[17]  Massimiliano Pontil,et al.  Distance-Based Regularisation of Deep Networks for Fine-Tuning , 2020, ICLR.

[18]  Shivani Agarwal,et al.  Learning from Noisy Labels with No Change to the Training Process , 2021, ICML.

[19]  Eli Upfal,et al.  Adversarial Multi Class Learning under Weak Supervision with Performance Guarantees , 2021, ICML.

[20]  Weijie J. Su,et al.  Analysis of Information Transfer from Heterogeneous Sources via Precise High-dimensional Asymptotics , 2020, 2010.11750.

[21]  Rong Ge,et al.  Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks , 2020, ArXiv.

[22]  Behnam Neyshabur,et al.  What is being transferred in transfer learning? , 2020, NeurIPS.

[23]  Sheng Liu,et al.  Early-Learning Regularization Prevents Memorization of Noisy Labels , 2020, NeurIPS.

[24]  James Bailey,et al.  Normalized Loss Functions for Deep Learning with Noisy Labels , 2020, ICML.

[25]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[26]  Gang Niu,et al.  Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning , 2020, NeurIPS.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Subhransu Maji,et al.  Exploring and Predicting Transferability across NLP Tasks , 2020, EMNLP.

[29]  Sen Wu,et al.  On the Generalization Effects of Linear Transformations in Data Augmentation , 2020, ICML.

[30]  Colin Wei,et al.  Improved Sample Complexities for Deep Neural Networks and Robust Classification via an All-Layer Margin , 2020, ICLR.

[31]  Sen Wu,et al.  Understanding and Improving Information Transfer in Multi-Task Learning , 2020, ICLR.

[32]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[33]  Aditya Krishna Menon,et al.  Does label smoothing mitigate label noise? , 2020, ICML.

[34]  Chao Zhang,et al.  Self-Adaptive Training: beyond Empirical Risk Minimization , 2020, NeurIPS.

[35]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[36]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[37]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[38]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Philip M. Long,et al.  Generalization bounds for deep convolutional neural networks , 2019, ICLR.

[40]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[41]  Masashi Sugiyama,et al.  Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis , 2019, ICML.

[42]  Steve Hanneke,et al.  On the Value of Target Data in Transfer Learning , 2020, NeurIPS.

[43]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[44]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[45]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[46]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[47]  J. Zico Kolter,et al.  Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience , 2019, ICLR.

[48]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[49]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[50]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[51]  Haoyi Xiong,et al.  DELTA: DEep Learning Transfer using Feature Map with Attention for Convolutional Networks , 2019, ICLR.

[52]  Vardan Papyan,et al.  Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.

[53]  Benjamin Guedj,et al.  A Primer on PAC-Bayesian Learning , 2019, ICML 2019.

[54]  J. Zico Kolter,et al.  Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.

[55]  Bo Wang,et al.  Moment Matching for Multi-Source Domain Adaptation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[57]  Ryan P. Adams,et al.  Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach , 2018, ICLR.

[58]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[61]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[62]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[63]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[64]  Pierre Vandergheynst,et al.  PAC-BAYESIAN MARGIN BOUNDS FOR CONVOLUTIONAL NEURAL NETWORKS , 2018 .

[65]  Anima Anandkumar,et al.  Learning From Noisy Singly-labeled Data , 2017, ICLR.

[66]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[67]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[68]  Rico Sennrich,et al.  Regularization techniques for fine-tuning in neural machine translation , 2017, EMNLP.

[69]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[70]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[71]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[72]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[73]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[74]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Nagarajan Natarajan,et al.  Cost-Sensitive Learning with Noisy Labels , 2017, J. Mach. Learn. Res..

[76]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[77]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[78]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[81]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[82]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[83]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[84]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[85]  Shai Ben-David,et al.  A notion of task relatedness yielding provable multiple-task learning guarantees , 2008, Machine Learning.

[86]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[87]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[88]  Koby Crammer,et al.  Learning from Multiple Sources , 2006, NIPS.

[89]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[90]  Shai Ben-David,et al.  A theoretical framework for learning from a pool of disparate data sources , 2002, KDD.

[91]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[92]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[93]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[94]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.