Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice

As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets. However, recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy once the number of training iterations is increased in some vision tasks. In this work, we revisit this phenomenon from the perspective of generalization analysis which is popular in learning theory. Our result reveals that the final prediction precision may have a weak dependency on the pre-trained model especially in the case of large training iterations. The observation inspires us to leverage pre-training data for fine-tuning, since this data is also available for fine-tuning. The generalization result of using pre-training data shows that the final performance on a target task can be improved when the appropriate pre-training data is included in fine-tuning. With the insight of the theoretical finding, we propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task. Extensive experimental results for image classification tasks on 8 benchmark data sets verify the effectiveness of the proposed data selection based fine-tuning pipeline.

[1]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[2]  Antoni B. Chan,et al.  A Generalized Loss Function for Crowd Counting and Localization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[4]  Armen Aghajanyan,et al.  Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[5]  Yang Song,et al.  Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Yingbin Liang,et al.  SpiderBoost and Momentum: Faster Variance Reduction Algorithms , 2019, NeurIPS.

[7]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[9]  Qi Qian,et al.  Hierarchically Robust Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jian Li,et al.  A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[11]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[13]  Nicolo Fusi,et al.  Geometric Dataset Distances via Optimal Transport , 2020, NeurIPS.

[14]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Mingsheng Long,et al.  Co-Tuning for Transfer Learning , 2020, NeurIPS.

[17]  Frank Nielsen,et al.  Sinkhorn AutoEncoders , 2018, UAI.

[18]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Massimiliano Pontil,et al.  Distance-Based Regularisation of Deep Networks for Fine-Tuning , 2021, ICLR.

[20]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Marco Cuturi,et al.  Computational Optimal Transport: With Applications to Data Science , 2019 .

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[24]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[26]  Rogério Schmidt Feris,et al.  SpotTune: Transfer Learning Through Adaptive Fine-Tuning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  H. Robbins A Stochastic Approximation Method , 1951 .

[28]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[29]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Yizhou Yu,et al.  Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-Tuning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yan Yan,et al.  Stagewise Training Accelerates Convergence of Testing Error Over SGD , 2018, NeurIPS.

[32]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[33]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[36]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[37]  Xinyang Chen,et al.  Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning , 2019, NeurIPS.

[38]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[41]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[42]  Haoyi Xiong,et al.  DELTA: DEep Learning Transfer using Feature Map with Attention for Convolutional Networks , 2019, ICLR.

[43]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[47]  Francesco Orabona,et al.  Exponential Step Sizes for Non-Convex Optimization , 2020, ArXiv.

[48]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[49]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[50]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.