Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic Q,K , and V embedding may lead to a performance drop. In this paper, we propose three types of structures for Q, K , and V embedding. The first structure utilizes two layers with ReLU, which is a non-linear embedding for Q,K , and V . The second involves sharing one of the non-linear layers to share knowledge among Q,K , and V . The third proposed structure shares all non-linear layers with code parameters. The codes are trainable, and the values determine the embedding process to be performed among Q, K , and V . Hence, we demonstrate the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. The proposed method achieved 71.4% with a few parameters (of 3.1M ) on the ImageNet-1k dataset compared to that required by the original transformer model of XCiT-N12 (69.9%). Additionally, the method achieved 93.3% with only 2.9M parameters in transfer learning on average for the CIFAR-10, CIFAR-100, Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy of 92.2% obtained via the original XCiT-N12 model.

[1]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[4]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[5]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[8]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[12]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[17]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[18]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Dacheng Tao,et al.  ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias , 2021, NeurIPS.

[22]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[23]  Weijian Li,et al.  ConTNet: Why not use convolution and transformer at the same time? , 2021, ArXiv.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Levent Sagun,et al.  ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[26]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[27]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[28]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[29]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Alexander Kolesnikov,et al.  How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, ArXiv.

[31]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[32]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[33]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[35]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[36]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[37]  Lu Yuan,et al.  Mobile-Former: Bridging MobileNet and Transformer , 2021, ArXiv.

[38]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.