Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformers (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance compared with CNNs when trained from scratch on a midsize dataset (e.g., ImageNet). We find it is because: 1) the simple tokenization of input images fails to model the important local structure (e.g., edges, lines) among neighboring pixels, leading to its low training sample efficiency; 2) the redundant attention backbone design of ViT leads to limited feature richness in fixed computation budgets and limited training samples. To overcome such limitations, we propose a new TokensTo-Token Vision Transformers (T2T-ViT), which introduces 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-toToken), such that local structure presented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformers motivated by CNN architecture design after extensive study. Notably, T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet. It also outperforms ResNets and achieves comparable performance with MobileNets by directly training on ImageNet. For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% accuracy on ImageNet. 1

[1]  Shuicheng Yan,et al.  A2-Nets: Double Attention Networks , 2018, NeurIPS.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Shuicheng Yan,et al.  Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes , 2020, ACM Multimedia.

[4]  Jianyuan Guo,et al.  GhostNet: More Features From Cheap Operations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Xiaogang Wang,et al.  End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[8]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[11]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hongyang Chao,et al.  Learning Joint Spatial-Temporal Transformations for Video Inpainting , 2020, ECCV.

[13]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[14]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[16]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[17]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[19]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[28]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[29]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[34]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[35]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[37]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[39]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[41]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[42]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[43]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[44]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[46]  Yiming Yang,et al.  Rethinking Transformer-based Set Prediction for Object Detection , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[48]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Junying Chen,et al.  UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Shuicheng Yan,et al.  A Simple Baseline for Pose Tracking in Videos of Crowed Scenes , 2020, ACM Multimedia.

[51]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[52]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[54]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[55]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jiashi Feng,et al.  Neural Epitome Search for Architecture-Agnostic Network Compression , 2020, ICLR.

[58]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[59]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[60]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[61]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[62]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yi Zhang,et al.  PSANet: Point-wise Spatial Attention Network for Scene Parsing , 2018, ECCV.