ImageNet-21K Pretraining for the Masses

ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which contains more pictures and classes, is used less frequently for pretraining, mainly due to its complexity, and underestimation of its added value compared to standard ImageNet-1K pretraining. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilizing WordNet hierarchies, and a novel training scheme called semantic softmax, we show that various models, including small mobile-oriented models, significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT. Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-

[1]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Joan Puigcerver,et al.  Scalable Transfer Learning with Expert Models , 2020, ICLR.

[3]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[5]  Quoc V. Le,et al.  Domain Adaptive Transfer Learning with Specialist Models , 2018, ArXiv.

[6]  Joachim Denzler,et al.  Integrating domain knowledge: using hierarchies to improve deep classifiers , 2018, ACPR.

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10]  Kiho Hong,et al.  Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network , 2020, ArXiv.

[11]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[12]  Itamar Friedman,et al.  TResNet: High Performance GPU-Dedicated Architecture , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14]  Wei Liu,et al.  CPTR: Full Transformer Network for Image Captioning , 2021, ArXiv.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18]  Leslie N. Smith,et al.  A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay , 2018, ArXiv.

[19]  Vikram A. Saletore,et al.  Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train , 2017, ArXiv.

[20]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[21]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[23]  Stefano Soatto,et al.  A Baseline for Few-Shot Image Classification , 2019, ICLR.

[24]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Razvan Pascanu,et al.  Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[26]  Abhinav Gupta,et al.  ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Marius Mosbach,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[28]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[30]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Lihi Zelnik-Manor,et al.  An Image is Worth 16x16 Words, What is a Video Worth? , 2021, ArXiv.

[32]  Jeremy Howard,et al.  fastai: A Layered API for Deep Learning , 2020, Inf..

[33]  Omri Bar,et al.  Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[34]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[35]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[36]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Lihi Zelnik-Manor,et al.  Asymmetric Loss For Multi-Label Classification , 2020, ArXiv.

[39]  Michael Crawshaw,et al.  Multi-Task Learning with Deep Neural Networks: A Survey , 2020, ArXiv.

[40]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[41]  Ajay Divakaran,et al.  FoodX-251: A Dataset for Fine-grained Food Classification , 2019, ArXiv.

[42]  Seong Joon Oh,et al.  Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Joan Puigcerver,et al.  Deep Ensembles for Low-Data Transfer Learning , 2020, ArXiv.

[44]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Chuang Gan,et al.  Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[47]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[48]  Jonathon Shlens,et al.  Deep Networks With Large Output Spaces , 2014, ICLR.

[49]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[50]  Shekoofeh Azizi,et al.  Big Self-Supervised Models Advance Medical Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[52]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[54]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[55]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[56]  Gunhee Kim,et al.  Taxonomy-Regularized Semantic Deep Convolutional Neural Networks , 2016, ECCV.

[57]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[58]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[59]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.