Vision Transformers are Parameter-Efficient Audio-Visual Learners
暂无分享,去创建一个
[1] Tanvir Mahmud,et al. AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[2] Z. Kira,et al. Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks , 2022, NeurIPS.
[3] Mohit Bansal,et al. TVLT: Textless Vision-Language Transformer , 2022, NeurIPS.
[4] Shentong Mo,et al. A Closer Look at Weakly-Supervised Audio-Visual Source Localization , 2022, NeurIPS.
[5] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.
[6] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.
[7] Stan Birchfield,et al. Audio-Visual Segmentation , 2022, ECCV.
[8] Hongsheng Li,et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.
[9] Kalyan Vasudev Alwala,et al. OmniMAE: Single Model Masked Pretraining on Images and Videos , 2022, ArXiv.
[10] Mohit Bansal,et al. LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning , 2022, NeurIPS.
[11] Zhou Zhao,et al. Cross-modal Background Suppression for Audio-Visual Event Localization , 2022, Computer Vision and Pattern Recognition.
[12] Andrew Owens,et al. Mix and Localize: Localizing Sound Sources in Mixtures , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Jifeng Dai,et al. Vision Transformer Adapter for Dense Predictions , 2022, ICLR.
[14] Colin Raffel,et al. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , 2022, NeurIPS.
[15] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[16] Chen Qian,et al. Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing , 2022, ECCV.
[17] Mohit Bansal,et al. ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound , 2022, ECCV.
[18] C. Schmid,et al. Learning Audio-Video Modalities from Image Captions , 2022, ECCV.
[19] David F. Harwath,et al. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer , 2022, INTERSPEECH.
[20] Cees G. M. Snoek,et al. Audio-Adaptive Activity Recognition Across Video Domains , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Yapeng Tian,et al. Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Limin Wang,et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.
[23] Shentong Mo,et al. Localizing Visual Sounds the Easy Way , 2022, ECCV.
[24] L. V. D. Maaten,et al. Omnivore: A Single Model for Many Visual Modalities , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Abdel-rahman Mohamed,et al. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.
[26] Mohit Bansal,et al. VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Faisal Ahmed,et al. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Yuejie Zhang,et al. MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing , 2021, ACM Multimedia.
[29] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Yoav Goldberg,et al. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.
[31] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[32] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[33] Varshanth R. Rao,et al. Dual Perspective Network for Audio-Visual Event Localization , 2022, ECCV.
[34] Alexander H. Liu,et al. UAVM: A Unified Model for Audio-Visual Learning , 2022, ArXiv.
[35] Krzysztof Choromanski,et al. PolyViT: Co-training Vision Transformers on Images, Videos and Audio , 2021, Trans. Mach. Learn. Res..
[36] Colin Raffel,et al. Training Neural Networks with Fixed Sparse Masks , 2021, NeurIPS.
[37] Peng Gao,et al. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.
[38] Youngjae Yu,et al. Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.
[40] Sebastian Ruder,et al. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks , 2021, ACL.
[41] Yu Wu,et al. Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[43] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.
[44] Andrea Vedaldi,et al. Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Yapeng Tian,et al. Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[47] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Shijie Hao,et al. Positive Sample Propagation along the Audio-Visual Event Line , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[50] P. Abbeel,et al. Pretrained Transformers as Universal Computation Engines , 2021, ArXiv.
[51] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[52] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[53] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[54] James R. Glass,et al. PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[55] Kristen Grauman,et al. VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Alexander M. Rush,et al. Parameter-Efficient Transfer Learning with Diff Pruning , 2020, ACL.
[57] Yale Song,et al. Parameter Efficient Multimodal Transformers for Video Representation Learning , 2020, ICLR.
[58] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[59] Geoffrey Zweig,et al. On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[60] Tae-Hyun Oh,et al. Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[61] Joe Davison,et al. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.
[62] Ming-Hsuan Yang,et al. Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing , 2021, NeurIPS.
[63] Sungrack Yun,et al. Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization , 2021, ICLR.
[64] Daniel McDuff,et al. Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.
[65] Daniel McDuff,et al. Contrastive Learning of Global and Local Audio-Visual Representations , 2021, ArXiv.
[66] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.
[67] Runhao Zeng,et al. Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.
[68] Weiyao Lin,et al. Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.
[69] Andrew Owens,et al. Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.
[70] Chenliang Xu,et al. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.
[71] Weiyao Lin,et al. Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.
[72] Janani Ramaswamy,et al. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[73] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[74] Yan Yan,et al. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.
[75] Sukhendu Das,et al. See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[76] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.
[77] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[78] Yu-Chiang Frank Wang,et al. Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization , 2020, ACCV.
[79] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[80] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[81] Kristen Grauman,et al. Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[82] Tamir Hazan,et al. A Simple Baseline for Audio-Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[83] Yu-Chiang Frank Wang,et al. Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[84] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.
[85] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[86] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[87] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[88] Andrea Vedaldi,et al. Efficient Parametrization of Multi-domain Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[89] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[90] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[91] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[92] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[93] Andrea Vedaldi,et al. Learning multiple visual domains with residual adapters , 2017, NIPS.
[94] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[95] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[96] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[97] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[98] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.