ConvNets vs. Transformers: Whose Visual Representations are More Transferable?

Vision transformers have attracted much attention from computer vision researchers as they are not restricted to the spatial inductive bias of ConvNets. However, although Transformer-based backbones have achieved much progress on ImageNet classification, it is still unclear whether the learned representations are as transferable as or even more transferable than ConvNets’ features. To address this point, we systematically investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations. We observe consistent advantages of Transformer-based backbones on 13 downstream tasks (out of 15), including but not limited to fine-grained classification, scene recognition (classification, segmentation and depth estimation), open-domain classification, face recognition, etc. More specifically, we find that two ViT models heavily rely on whole network fine-tuning to achieve performance gains while Swin Transformer does not have such a requirement. Moreover, vision transformers behave more robustly in multi-task learning, i.e., bringing more improvements when managing mutually beneficial tasks and reducing performance losses when tackling irrelevant tasks. We hope our discoveries can facilitate the exploration and exploitation of vision transformers in the future.

[1]  Vladlen Koltun,et al.  Vision Transformers for Dense Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[7]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[9]  Shiguang Shan,et al.  Mean-Variance Loss for Deep Age Estimation from a Face , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[12]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[13]  Joseph Paul Cohen,et al.  COVID-19 Image Data Collection: Prospective Predictions Are the Future , 2020, ArXiv.

[14]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[17]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[20]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[21]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[24]  Kiyoshi Tanaka,et al.  Ceci n'est pas une pipe: A deep convolutional network for fine-art paintings classification , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[25]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[26]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.