Vision Transformer for Small-Size Datasets

Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in TinyImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.

[1]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Yu-Lin He,et al.  Determining the optimal temperature parameter for Softmax function in reinforcement learning , 2018, Appl. Soft Comput..

[4]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[5]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[7]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[8]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[14]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[15]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ya Le,et al.  Tiny ImageNet Visual Recognition Challenge , 2015 .

[17]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Andre Araujo,et al.  Computing Receptive Fields of Convolutional Neural Networks , 2019, Distill.

[20]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[21]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[22]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[33]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[38]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Behnam Neyshabur,et al.  Towards Learning Convolutions from Scratch , 2020, NeurIPS.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.