Big Transfer (BiT): General Visual Representation Learning

Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.

[1]  Jonathan S. Rosenfeld,et al.  A Constructive Prediction of the Generalization Error Across Scales , 2020, ICLR.

[2]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Quoc V. Le,et al.  Domain Adaptive Transfer Learning with Specialist Models , 2018, ArXiv.

[5]  Allan Jabri,et al.  Learning Visual N-Grams from Web Data , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Matthijs Douze,et al.  Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[10]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[11]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[12]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[13]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[14]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xiaohua Zhai,et al.  A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019 .

[16]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[17]  Sanjeev Arora,et al.  An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[18]  Samy Bengio,et al.  Transfusion: Understanding Transfer Learning with Applications to Medical Imaging , 2019, ArXiv.

[19]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[20]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pietro Perona,et al.  Recognition in Terra Incognita , 2018, ECCV.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Yu-Chiang Frank Wang,et al.  A Closer Look at Few-shot Classification , 2019, ICLR.

[25]  Kaiming He,et al.  Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[29]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[33]  Wei Shen,et al.  Weight Standardization , 2019, ArXiv.

[34]  Joachim Denzler,et al.  Do We Train on Test Data? Purging CIFAR of Near-Duplicates , 2019, J. Imaging.

[35]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[36]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[37]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[39]  Ivan Laptev,et al.  Weakly-Supervised Learning of Visual Relations , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[41]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[42]  Bernt Schiele,et al.  Not Using the Car to See the Sidewalk — Quantifying and Controlling the Effects of Context in Classification and Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Allan Jabri,et al.  Learning Visual Features from Large Weakly Supervised Data , 2015, ECCV.

[45]  Andrew Gordon Wilson,et al.  There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average , 2018, ICLR.

[46]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[47]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kan Chen,et al.  Billion-scale semi-supervised learning for image classification , 2019, ArXiv.

[49]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[51]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Tatsuya Harada,et al.  Revisiting Fine-tuning for Few-shot Learning , 2019, ArXiv.

[53]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[54]  Chenxi Liu,et al.  Micro-Batch Training with Batch-Channel Normalization and Weight Standardization , 2019 .

[55]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[56]  Samuel L. Smith,et al.  Batch Normalization has Multiple Benefits: An Empirical Study on Residual Networks , 2019 .

[57]  Michael Tschannen,et al.  Self-Supervised Learning of Video-Induced Visual Invariances , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[59]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60]  David Berthelot,et al.  ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , 2019, ArXiv.

[61]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[62]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[63]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[64]  Alexander Kolesnikov,et al.  S$^\mathbf{4}$L: Self-Supervised Semi-Supervised Learning , 2019 .

[65]  Ali Borji,et al.  ObjectNet Dataset: Reanalysis and Correction , 2020, ArXiv.

[66]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[67]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[69]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.