论文信息 - ImageNet-21K Pretraining for the Masses

ImageNet-21K Pretraining for the Masses

ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which contains more pictures and classes, is used less frequently for pretraining, mainly due to its complexity, and underestimation of its added value compared to standard ImageNet-1K pretraining. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilizing WordNet hierarchies, and a novel training scheme called semantic softmax, we show that various models, including small mobile-oriented models, significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT. Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-

[1] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Joan Puigcerver,et al. Scalable Transfer Learning with Expert Models , 2020, ICLR.

[3] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[5] Quoc V. Le,et al. Domain Adaptive Transfer Learning with Specialist Models , 2018, ArXiv.

[6] Joachim Denzler,et al. Integrating domain knowledge: using hierarchies to improve deep classifiers , 2018, ACPR.

[7] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8] Quoc V. Le,et al. Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[10] Kiho Hong,et al. Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network , 2020, ArXiv.

[11] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[12] Itamar Friedman,et al. TResNet: High Performance GPU-Dedicated Architecture , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14] Wei Liu,et al. CPTR: Full Transformer Network for Image Captioning , 2021, ArXiv.

[15] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16] Yang Song,et al. The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18] Leslie N. Smith,et al. A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay , 2018, ArXiv.

[19] Vikram A. Saletore,et al. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train , 2017, ArXiv.

[20] S. Levine,et al. Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[21] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Alexander Kolesnikov,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[23] Stefano Soatto,et al. A Baseline for Few-Shot Image Classification , 2019, ICLR.

[24] Tao Xiang,et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Razvan Pascanu,et al. Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[26] Abhinav Gupta,et al. ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Marius Mosbach,et al. On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ArXiv.

[28] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Graham W. Taylor,et al. Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[30] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] Lihi Zelnik-Manor,et al. An Image is Worth 16x16 Words, What is a Video Worth? , 2021, ArXiv.

[32] Jeremy Howard,et al. fastai: A Layered API for Deep Learning , 2020, Inf..

[33] Omri Bar,et al. Video Transformer Network , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[34] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[35] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[36] Quoc V. Le,et al. Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Lihi Zelnik-Manor,et al. Asymmetric Loss For Multi-Label Classification , 2020, ArXiv.

[39] Michael Crawshaw,et al. Multi-Task Learning with Deep Neural Networks: A Survey , 2020, ArXiv.

[40] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[41] Ajay Divakaran,et al. FoodX-251: A Dataset for Fine-grained Food Classification , 2019, ArXiv.

[42] Seong Joon Oh,et al. Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Joan Puigcerver,et al. Deep Ensembles for Low-Data Transfer Learning , 2020, ArXiv.

[44] Jianping Gou,et al. Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[45] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46] Chuang Gan,et al. Once for All: Train One Network and Specialize it for Efficient Deployment , 2019, ICLR.

[47] Lucas Beyer,et al. Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[48] Jonathon Shlens,et al. Deep Networks With Large Output Spaces , 2014, ICLR.

[49] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[50] Shekoofeh Azizi,et al. Big Self-Supervised Models Advance Medical Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[51] Alexei A. Efros,et al. What makes ImageNet good for transfer learning? , 2016, ArXiv.

[52] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53] Jordi Pont-Tuset,et al. The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[54] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[55] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[56] Gunhee Kim,et al. Taxonomy-Regularized Semantic Deep Convolutional Neural Networks , 2016, ECCV.

[57] Taghi M. Khoshgoftaar,et al. Survey on deep learning with class imbalance , 2019, J. Big Data.

[58] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[59] Zhao Chen,et al. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.