Vision Transformers are Parameter-Efficient Audio-Visual Learners

Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/

[1]  Tanvir Mahmud,et al.  AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization , 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  Z. Kira,et al.  Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks , 2022, NeurIPS.

[3]  Mohit Bansal,et al.  TVLT: Textless Vision-Language Transformer , 2022, NeurIPS.

[4]  Shentong Mo,et al.  A Closer Look at Weakly-Supervised Audio-Visual Source Localization , 2022, NeurIPS.

[5]  Gerard de Melo,et al.  Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.

[6]  Michael Auli,et al.  Masked Autoencoders that Listen , 2022, NeurIPS.

[7]  Stan Birchfield,et al.  Audio-Visual Segmentation , 2022, ECCV.

[8]  Hongsheng Li,et al.  ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.

[9]  Kalyan Vasudev Alwala,et al.  OmniMAE: Single Model Masked Pretraining on Images and Videos , 2022, ArXiv.

[10]  Mohit Bansal,et al.  LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning , 2022, NeurIPS.

[11]  Zhou Zhao,et al.  Cross-modal Background Suppression for Audio-Visual Event Localization , 2022, Computer Vision and Pattern Recognition.

[12]  Andrew Owens,et al.  Mix and Localize: Localizing Sound Sources in Mixtures , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jifeng Dai,et al.  Vision Transformer Adapter for Dense Predictions , 2022, ICLR.

[14]  Colin Raffel,et al.  Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , 2022, NeurIPS.

[15]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[16]  Chen Qian,et al.  Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing , 2022, ECCV.

[17]  Mohit Bansal,et al.  ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound , 2022, ECCV.

[18]  C. Schmid,et al.  Learning Audio-Video Modalities from Image Captions , 2022, ECCV.

[19]  David F. Harwath,et al.  MAE-AST: Masked Autoencoding Audio Spectrogram Transformer , 2022, INTERSPEECH.

[20]  Cees G. M. Snoek,et al.  Audio-Adaptive Activity Recognition Across Video Domains , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yapeng Tian,et al.  Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[23]  Shentong Mo,et al.  Localizing Visual Sounds the Easy Way , 2022, ECCV.

[24]  L. V. D. Maaten,et al.  Omnivore: A Single Model for Many Visual Modalities , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Abdel-rahman Mohamed,et al.  Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.

[26]  Mohit Bansal,et al.  VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Faisal Ahmed,et al.  SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yuejie Zhang,et al.  MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing , 2021, ACM Multimedia.

[29]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[31]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[32]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[33]  Varshanth R. Rao,et al.  Dual Perspective Network for Audio-Visual Event Localization , 2022, ECCV.

[34]  Alexander H. Liu,et al.  UAVM: A Unified Model for Audio-Visual Learning , 2022, ArXiv.

[35]  Krzysztof Choromanski,et al.  PolyViT: Co-training Vision Transformers on Images, Videos and Audio , 2021, Trans. Mach. Learn. Res..

[36]  Colin Raffel,et al.  Training Neural Networks with Fixed Sparse Masks , 2021, NeurIPS.

[37]  Peng Gao,et al.  Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[38]  Youngjae Yu,et al.  Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  C. Schmid,et al.  Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.

[40]  Sebastian Ruder,et al.  Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks , 2021, ACL.

[41]  Yu Wu,et al.  Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[43]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[44]  Andrea Vedaldi,et al.  Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yapeng Tian,et al.  Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  James R. Glass,et al.  AST: Audio Spectrogram Transformer , 2021, Interspeech.

[47]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Shijie Hao,et al.  Positive Sample Propagation along the Audio-Visual Event Line , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  P. Abbeel,et al.  Pretrained Transformers as Universal Computation Engines , 2021, ArXiv.

[51]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[52]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[53]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[54]  James R. Glass,et al.  PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[55]  Kristen Grauman,et al.  VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Alexander M. Rush,et al.  Parameter-Efficient Transfer Learning with Diff Pruning , 2020, ACL.

[57]  Yale Song,et al.  Parameter Efficient Multimodal Transformers for Video Representation Learning , 2020, ICLR.

[58]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[59]  Geoffrey Zweig,et al.  On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Tae-Hyun Oh,et al.  Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[62]  Ming-Hsuan Yang,et al.  Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing , 2021, NeurIPS.

[63]  Sungrack Yun,et al.  Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization , 2021, ICLR.

[64]  Daniel McDuff,et al.  Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.

[65]  Daniel McDuff,et al.  Contrastive Learning of Global and Local Audio-Visual Representations , 2021, ArXiv.

[66]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[67]  Runhao Zeng,et al.  Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.

[68]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[69]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[70]  Chenliang Xu,et al.  Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing , 2020, ECCV.

[71]  Weiyao Lin,et al.  Multiple Sound Sources Localization from Coarse to Fine , 2020, ECCV.

[72]  Janani Ramaswamy,et al.  What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Yan Yan,et al.  Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.

[75]  Sukhendu Das,et al.  See the Sound, Hear the Pixels , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[76]  Yong Jae Lee,et al.  Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[77]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[78]  Yu-Chiang Frank Wang,et al.  Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization , 2020, ACCV.

[79]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[80]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Tamir Hazan,et al.  A Simple Baseline for Audio-Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Yu-Chiang Frank Wang,et al.  Dual-modality Seq2Seq Network for Audio-visual Event Localization , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[84]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[85]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[86]  Lorenzo Torresani,et al.  Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[87]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[88]  Andrea Vedaldi,et al.  Efficient Parametrization of Multi-domain Deep Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[90]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[92]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[93]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[94]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[95]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[96]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[98]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.