Token Merging: Your ViT But Faster
暂无分享,去创建一个
Cheng-Yang Fu | Judy Hoffman | Christoph Feichtenhofer | Xiaoliang Dai | Peizhao Zhang | Daniel Bolya
[1] Jianxin Wu,et al. A unified pruning framework for vision transformers , 2021, Science China Information Sciences.
[2] Cheng-Yang Fu,et al. Hydra Attention: Efficient Attention with Many Heads , 2022, ECCV Workshops.
[3] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.
[4] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.
[5] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.
[6] Ari S. Morcos,et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time , 2022, ICML.
[7] Yihong Xu,et al. CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction , 2022, ArXiv.
[8] Shalini De Mello,et al. GroupViT: Semantic Segmentation Emerges from Text Supervision , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Yi Ren,et al. Pseudo Numerical Methods for Diffusion Models on Manifolds , 2022, ICLR.
[10] P. Xie,et al. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations , 2022, ArXiv.
[11] Aaron B. Adcock,et al. Revisiting Weakly Supervised Pre-Training of Visual Perception Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] J. Álvarez,et al. A-ViT: Adaptive Tokens for Efficient Vision Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Yu-Gang Jiang,et al. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Han Hu,et al. SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Mohammad Rastegari,et al. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , 2021, ICLR.
[21] K. Keutzer,et al. Learned Token Pruning for Transformers , 2021, KDD.
[22] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Jianfei Cai,et al. Less is More: Pay Less Attention in Vision Transformers , 2021, AAAI.
[26] Yanzhi Wang,et al. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning , 2021, ECCV.
[27] St'ephane Clinchant,et al. A Study on Token Pruning for ColBERT , 2021, ArXiv.
[28] Anish K. Prabhu,et al. Token Pooling in Vision Transformers , 2021, ArXiv.
[29] Jakob Uszkoreit,et al. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..
[30] Jiwen Lu,et al. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.
[31] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] Matthijs Douze,et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[35] Kyunghyun Cho,et al. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search , 2020, ACL.
[36] Lucy J. Colwell,et al. Rethinking Attention with Performers , 2020, ICLR.
[37] Shuai Yi,et al. Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[38] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] A. Piergiovanni,et al. TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.
[40] Eric Sommerlade,et al. ATS: Adaptive Token Sampling For Efficient Vision Transformers , 2021, ArXiv.
[41] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.
[42] Anamitra R. Choudhury,et al. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , 2020, ICML.
[43] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[44] Yang Song,et al. Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.
[45] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[46] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.
[47] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[48] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[50] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[51] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[52] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.
[53] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.
[54] Sanja Fidler,et al. Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[55] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[56] Olga Veksler,et al. Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.
[57] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.