SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%∼41% superior to existing works) on the mobile device with 0.25%∼4% higher top-1 accuracy on ImageNet. Our code will be released soon. *Both authors contributed equally.

[1]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dacheng Tao,et al.  Patch Slimming for Efficient Vision Transformers , 2021, ArXiv.

[4]  C. Lawrence Zitnick,et al.  Generative Adversarial Transformers , 2021, ICML.

[5]  Nicu Sebe,et al.  Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[7]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wanli Ouyang,et al.  PSViT: Better Vision Transformer via Token Pooling and Attention Sharing , 2021, ArXiv.

[9]  Aude Oliva,et al.  IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers , 2021, NeurIPS.

[10]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[11]  Jianfei Cai,et al.  Scalable Vision Transformers with Hierarchical Pooling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Ivan Laptev,et al.  Training Vision Transformers for Image Retrieval , 2021, ArXiv.

[13]  Eric Sommerlade,et al.  ATS: Adaptive Token Sampling For Efficient Vision Transformers , 2021, ArXiv.

[14]  Dacheng Tao,et al.  Efficient Vision Transformers via Fine-Grained Manifold Distillation , 2021, ArXiv.

[15]  Junying Chen,et al.  UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ji Li,et al.  Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.

[17]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Sven Behnke,et al.  T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression , 2021, GCPR.

[20]  Cho-Jui Hsieh,et al.  When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ArXiv.

[21]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[22]  Nicu Sebe,et al.  AniFormer: Data-driven 3D Animation with Transformer , 2021, BMVC.

[23]  Alexey Dosovitskiy,et al.  Do Vision Transformers See Like Convolutional Neural Networks? , 2021, ArXiv.

[24]  Jianlong Fu,et al.  Learning Spatio-Temporal Transformer for Visual Tracking , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Roozbeh Mottaghi,et al.  Container: Context Aggregation Network , 2021, NeurIPS.

[26]  Zhe Gan,et al.  Chasing Sparsity in Vision Transformers: An End-to-End Exploration , 2021, NeurIPS.

[27]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[28]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[29]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Kurt Keutzer,et al.  You Only Group Once: Efficient Point-Cloud Processing with Token Representation and Relation Inference Module , 2021, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[36]  Ting Chen,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ArXiv.

[37]  Jiaya Jia,et al.  Exploring and Improving Mobile Level Vision Transformers , 2021, ArXiv.

[38]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[39]  Yan Peng,et al.  Dual-stream Network for Visual Recognition , 2021, ArXiv.

[40]  Hanrui Wang,et al.  SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[41]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[42]  Juncheng Li,et al.  Efficient Transformer for Single Image Super-Resolution , 2021, ArXiv.

[43]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[45]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[46]  Xiaojie Jin,et al.  Refiner: Refining Self-attention for Vision Transformers , 2021, ArXiv.

[47]  Guodong Guo,et al.  TransFER: Learning Relation-aware Facial Expression Representations with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, ArXiv.

[49]  Kai Han,et al.  Visual Transformer Pruning , 2021, ArXiv.

[50]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[52]  Weiming Dong,et al.  Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer , 2021, AAAI.

[53]  Hongyang Chao,et al.  Rethinking and Improving Relative Position Encoding for Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Stephen Lin,et al.  Instance Localization for Self-supervised Detection Pretraining , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Luc Van Gool,et al.  MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation , 2021, ArXiv.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Nicu Sebe,et al.  Efficient Training of Visual Transformers with Small-Size Datasets , 2021, ArXiv.

[58]  Yingda Xia,et al.  Glance-and-Gaze Vision Transformer , 2021, NeurIPS.

[59]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[61]  Minyi Guo,et al.  Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[62]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[63]  Jiayu Li,et al.  ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers , 2018, ASPLOS.

[64]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[68]  Julian Martin Eisenschlos,et al.  SoftSort: A Continuous Relaxation for the argsort Operator , 2020, ICML.

[69]  Lorenzo Bruzzone,et al.  Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images , 2021, ArXiv.

[70]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[71]  Xiaojie Jin,et al.  All Tokens Matter: Token Labeling for Training Better Vision Transformers , 2021, NeurIPS.

[72]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[73]  Eun-Sol Kim,et al.  HOTR: End-to-End Human-Object Interaction Detection with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Rong Jin,et al.  KVT: k-NN Attention for Boosting Vision Transformers , 2021, ArXiv.

[75]  Michael S. Ryoo,et al.  TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[76]  Dahua Lin,et al.  Vision Transformer with Progressive Sampling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Alexander Kolesnikov,et al.  How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, ArXiv.

[78]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[79]  Jianxin Wu,et al.  A unified pruning framework for vision transformers , 2021, Science China Information Sciences.

[80]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[81]  Minghao Chen,et al.  AutoFormer: Searching Transformers for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Laura Leal-Taixe,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, ArXiv.

[83]  Wengang Zhou,et al.  TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).