Multi-Dimensional Model Compression of Vision Transformer

Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely, which may suffer from excessive reduction and lead to sub-optimal model quality. In contrast, we advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly. We firstly propose a statistical dependence based pruning criterion that is generalizable to different dimensions for identifying deleterious components. Moreover, we cast the multi-dimensional compression as an optimization, learning the optimal pruning policy across the three dimensions that maximizes the compressed model’s accuracy under a computational budget. The problem is solved by our adapted Gaussian process search with expected improvement. Experimental results show that our method effectively reduces the computational cost of various ViT models. For example, our method reduces 40% FLOPs without top-1 accuracy loss for DeiT and T2T-ViT models, outperforming previous state-of-the-arts.

[1]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[2]  Ameet Talwalkar,et al.  Random Search and Reproducibility for Neural Architecture Search , 2019, UAI.

[3]  Hanrui Wang,et al.  SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[4]  Dacheng Tao,et al.  Patch Slimming for Efficient Vision Transformers , 2021, ArXiv.

[5]  Aude Oliva,et al.  IA-RED2: Interpretability-Aware Redundancy Reduction for Vision Transformers , 2021, NeurIPS.

[6]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[7]  Dacheng Tao,et al.  Pruning Self-attentions into Convolutional Layers in Single Path , 2021, ArXiv.

[8]  Deng Cai,et al.  Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework , 2021, ICML.

[9]  Dacheng Tao,et al.  Efficient Vision Transformers via Fine-Grained Manifold Distillation , 2021, ArXiv.

[10]  Weiming Dong,et al.  Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer , 2021, AAAI.

[11]  Dong Xu,et al.  Multi-Dimensional Pruning: A Unified Framework for Model Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. Baker Joint measures and cross-covariance operators , 1973 .

[13]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[14]  Jianxin Wu,et al.  A unified pruning framework for vision transformers , 2021, Science China Information Sciences.

[15]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[18]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[19]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[20]  Kai Han,et al.  Visual Transformer Pruning , 2021, ArXiv.

[21]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[22]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[23]  Zhe Gan,et al.  Chasing Sparsity in Vision Transformers: An End-to-End Exploration , 2021, NeurIPS.

[24]  Pavlo Molchanov,et al.  NViT: Vision Transformer Compression and Parameter Redistribution , 2021, ArXiv.

[25]  Liujuan Cao,et al.  Towards Optimal Structured CNN Pruning via Generative Adversarial Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[27]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiangyu Zhang,et al.  Joint Multi-Dimension Pruning , 2020, ArXiv.

[29]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[30]  Siwei Ma,et al.  Post-Training Quantization for Vision Transformer , 2021, NeurIPS.

[31]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[32]  Xiangyu Zhang,et al.  Single Path One-Shot Neural Architecture Search with Uniform Sampling , 2019, ECCV.

[33]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.