Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional posttraining pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without scarifying the achievable accuracy. We launch and report the first-ofits-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs “from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of selfattention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results validate the effectiveness of our proposals on ImageNet with diverse ViT backbones. For instance, at 40% structured sparsity, our sparsified DeiT-Base can achieve 0.42% accuracy gain, at 33.13% FLOPs and 24.70% running time savings, compared to its dense counterpart. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing “free lunch”. For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

[1]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[2]  Xin Wang,et al.  Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization , 2019, ICML.

[3]  Mattan Erez,et al.  PruneTrain: fast neural network training by dynamic sparse model reconfiguration , 2019, SC.

[4]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[5]  Zhe Gan,et al.  Playing Lottery Tickets with Vision and Language , 2021, AAAI.

[6]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[7]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[8]  Shiyu Chang,et al.  The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[10]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Mykola Pechenizkiy,et al.  Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware , 2019, Neural Computing and Applications.

[12]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[13]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[14]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[15]  Hongyang Chao,et al.  Learning Joint Spatial-Temporal Transformations for Video Inpainting , 2020, ECCV.

[16]  Gintare Karolina Dziugaite,et al.  The Lottery Ticket Hypothesis at Scale , 2019, ArXiv.

[17]  Antonio Liotta,et al.  A topological insight into restricted Boltzmann machines , 2016, Machine Learning.

[18]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[19]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[20]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[21]  Tianlong Chen,et al.  GANs Can Play Lottery Tickets Too , 2021, ICLR.

[22]  Yongqiang Lyu,et al.  SNrram: An Efficient Sparse Neural Network Computation Architecture Based on Resistive Random-Access Memory , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[23]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[24]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[25]  Xiaogang Wang,et al.  End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[26]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[27]  Steve B. Furber,et al.  Memory-Efficient Deep Learning on a SpiNNaker 2 Prototype , 2018, Front. Neurosci..

[28]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[29]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[30]  Roger B. Grosse,et al.  Picking Winning Tickets Before Training by Preserving Gradient Flow , 2020, ICLR.

[31]  Erich Elsen,et al.  Rigging the Lottery: Making All Tickets Winners , 2020, ICML.

[32]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[33]  Zhe Gan,et al.  EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets , 2020, ACL.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Tor M. Aamodt,et al.  Sparse Weight Activation Training , 2020, NeurIPS.

[36]  Zhangyang Wang,et al.  Efficient Lottery Ticket Finding: Less Data is More , 2021, ICML.

[37]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[38]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  M. Ashby,et al.  Exploiting Unstructured Sparsity on Next-Generation Datacenter Hardware , 2019 .

[40]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[41]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[42]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[43]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.

[44]  Mykola Pechenizkiy,et al.  Selfish Sparse RNN Training , 2021, ICML.

[45]  Kai Han,et al.  Visual Transformer Pruning , 2021, ArXiv.

[46]  Vivienne Sze,et al.  Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[47]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[48]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[49]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[50]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[51]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[52]  Yue Wang,et al.  Drawing early-bird tickets: Towards more efficient training of deep networks , 2019, ICLR.

[53]  Mykola Pechenizkiy,et al.  Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training , 2021, ICML.

[54]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[56]  Haoyu Ma,et al.  Good Students Play Big Lottery Better , 2021, ArXiv.

[57]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[59]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, ArXiv.

[60]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[61]  Jack Xin,et al.  Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets , 2019, ICLR.

[62]  Hao Zhou,et al.  Less Is More: Towards Compact CNNs , 2016, ECCV.

[63]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[64]  Pavlo Molchanov,et al.  Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[66]  Erich Elsen,et al.  Fast Sparse ConvNets , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Junying Chen,et al.  UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[70]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[71]  Gordon Erlebacher,et al.  The Generalization-Stability Tradeoff in Neural Network Pruning , 2019, NeurIPS.

[72]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[73]  Zhe Gan,et al.  Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly , 2021, ArXiv.

[74]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[75]  B. Brookes,et al.  Statistical Theory of Extreme Values and Some Practical Applications , 1955, The Mathematical Gazette.

[76]  MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers , 2020, ArXiv.

[77]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[78]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[79]  Peter Stone,et al.  Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[80]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[81]  Dacheng Tao,et al.  Patch Slimming for Efficient Vision Transformers , 2021, ArXiv.

[82]  Aude Oliva,et al.  IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers , 2021, NeurIPS.

[83]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[84]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[85]  Erich Elsen,et al.  The Difficulty of Training Sparse Neural Networks , 2019, ArXiv.

[86]  Avirup Sil,et al.  Structured Pruning of a BERT-based Question Answering Model , 2019 .

[87]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[88]  Philip H. S. Torr,et al.  SNIP: Single-shot Network Pruning based on Connection Sensitivity , 2018, ICLR.

[89]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Razvan Pascanu,et al.  Top-KAST: Top-K Always Sparse Training , 2021, NeurIPS.

[91]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[92]  Zhangyang Wang,et al.  A Unified Lottery Ticket Hypothesis for Graph Neural Networks , 2021, ICML.

[93]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[94]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[95]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[96]  Shiyu Chang,et al.  The Lottery Ticket Hypothesis for Pre-trained BERT Networks , 2020, NeurIPS.