Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images

Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recent line of work posits strong benefits for model generalization and transfer when model size, data size, and compute budget are increased for the pre-training. It remains however still largely unclear whether the observed transfer improvement due to increase in scale also holds when source and target data distributions are far apart from each other. In this work we conduct largescale pre-training on large source datasets of either natural (ImageNet-21k/1k) or medical chest X-Ray images and compare full and few-shot transfer using different target datasets from both natural and medical imaging domains1. Our observations provide evidence that while pre-training and transfer on closely related datasets do show clear benefit of increasing model and data size during pre-training, such benefits are not clearly visible when source and target datasets are further apart. These observations hold across both full and few-shot transfer and indicate that scaling laws pointing to improvement of generalization and transfer with increasing model and data size are incomplete and should be revised by taking into account the type and proximity of the source and target data, to correctly predict the effect of model and data scale during pre-training on transfer. Remarkably, in full shot transfer to a large X-Ray chest imaging target (PadChest), the largest model pretrained on ImageNet-21k slightly outperforms best models pre-trained on large X-Ray chest imaging data. This indicates possibility to obtain high quality models for domain-specific transfer even without access to large domain-specific data, by pre-training instead on comparably very large, generic source data.

[1]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[2]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[5]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[6]  Xiaohua Zhai,et al.  A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019 .

[7]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[8]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[9]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[10]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[11]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[12]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[13]  Neil Houlsby,et al.  Supervised Transfer Learning at Scale for Medical Imaging , 2021, ArXiv.

[14]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[15]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[16]  Levent Sagun,et al.  Triple descent and the two kinds of overfitting: where and why do they appear? , 2020, NeurIPS.

[17]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[18]  Thomas Mensink,et al.  Factors of Influence for Transfer Learning Across Diverse Appearance Domains and Task Types , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Atsuto Maki,et al.  Factors of Transferability for a Generic ConvNet Representation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Alexander Wong,et al.  COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images , 2020, Scientific reports.

[22]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[23]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[24]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[25]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[28]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[29]  Stefan Jaeger,et al.  Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. , 2014, Quantitative imaging in medicine and surgery.

[30]  Steven Horng,et al.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , 2019, Scientific Data.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Antonio Pertusa,et al.  PadChest: A large chest x-ray image dataset with multi-label annotated reports , 2019, Medical Image Anal..

[35]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[40]  Jascha Sohl-Dickstein,et al.  Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..

[41]  A. Ng,et al.  CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation , 2021, CHIL.

[42]  Hugo Larochelle,et al.  Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples , 2019, ICLR.

[43]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[44]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[45]  Joseph Paul Cohen,et al.  TorchXRayVision: A library of chest X-ray datasets and models , 2021, MIDL.

[46]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[47]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.