论文信息 - Token Merging: Your ViT But Faster

Token Merging: Your ViT But Faster

We introduce Token Merging (ToMe), a simple method to increase the throughput of existing ViT models without needing to train. ToMe gradually combines similar tokens in a transformer using a general and light-weight matching algorithm that is as fast as pruning while being more accurate. Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2.2x the throughput of ViT-L on video with only a 0.2-0.3% accuracy drop in each case. ToMe can also easily be applied during training, improving in practice training speed up to 2x for MAE fine-tuning on video. Training with ToMe further minimizes accuracy drop, leading to 2x the throughput of ViT-B on audio for only a 0.4% mAP drop. Qualitatively, we find that ToMe merges object parts into one token, even over multiple frames of video. Overall, ToMe's accuracy and speed are competitive with state-of-the-art on images, video, and audio.

[1] Jianxin Wu,et al. A unified pruning framework for vision transformers , 2021, Science China Information Sciences.

[2] Cheng-Yang Fu,et al. Hydra Attention: Efficient Attention with Many Heads , 2022, ECCV Workshops.

[3] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.

[4] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[5] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.

[6] Ari S. Morcos,et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.

[7] Yihong Xu,et al. CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction , 2022, ArXiv.

[8] Shalini De Mello,et al. GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Yi Ren,et al. Pseudo Numerical Methods for Diffusion Models on Manifolds , 2022, ICLR.

[10] P. Xie,et al. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , 2022, ArXiv.

[11] Aaron B. Adcock,et al. Revisiting Weakly Supervised Pre-Training of Visual Perception Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] J. Álvarez,et al. A-ViT: Adaptive Tokens for Efficient Vision Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Yu-Gang Jiang,et al. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Han Hu,et al. SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Mohammad Rastegari,et al. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , 2021, ICLR.

[21] K. Keutzer,et al. Learned Token Pruning for Transformers , 2021, KDD.

[22] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jianfei Cai,et al. Less is More: Pay Less Attention in Vision Transformers , 2021, AAAI.

[26] Yanzhi Wang,et al. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning , 2021, ECCV.

[27] St'ephane Clinchant,et al. A Study on Token Pruning for ColBERT , 2021, ArXiv.

[28] Anish K. Prabhu,et al. Token Pooling in Vision Transformers , 2021, ArXiv.

[29] Jakob Uszkoreit,et al. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..

[30] Jiwen Lu,et al. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[31] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Matthijs Douze,et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[35] Kyunghyun Cho,et al. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search , 2020, ACL.

[36] Lucy J. Colwell,et al. Rethinking Attention with Performers , 2020, ICLR.

[37] Shuai Yi,et al. Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] A. Piergiovanni,et al. TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.

[40] Eric Sommerlade,et al. ATS: Adaptive Token Sampling For Efficient Vision Transformers , 2021, ArXiv.

[41] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[42] Anamitra R. Choudhury,et al. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.

[43] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[44] Yang Song,et al. Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[45] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[46] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[47] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[48] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[50] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[51] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[53] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[54] Sanja Fidler,et al. Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[56] Olga Veksler,et al. Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[57] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.