论文信息 - MAViL: Masked Audio-Video Learners

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.

[1] Yusong Wu,et al. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] James R. Glass,et al. Contrastive Audio-Visual Masked Autoencoder , 2022, ICLR.

[3] Mohit Bansal,et al. TVLT: Textless Vision-Language Transformer , 2022, NeurIPS.

[4] Rongrong Ji,et al. Exploring Target Representations for Masked Autoencoders , 2022, ArXiv.

[5] Xiaohui Shen,et al. Contrastive Masked Autoencoders are Stronger Vision Learners , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.

[7] Dong Chen,et al. Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation , 2022, ArXiv.

[8] Tara N. Sainath,et al. Self-Supervised Speech Representation Learning: A Review , 2022, IEEE Journal of Selected Topics in Signal Processing.

[9] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.

[10] A. Zamir,et al. MultiMAE: Multi-modal Multi-task Masked Autoencoders , 2022, ECCV.

[11] David F. Harwath,et al. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer , 2022, INTERSPEECH.

[12] Florian Metze,et al. AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification , 2022, INTERSPEECH.

[13] Lingxi Xie,et al. MVP: Multimodality-guided Visual Pre-training , 2022, ECCV.

[14] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[15] S. Dubnov,et al. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Abdel-rahman Mohamed,et al. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.

[17] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] James R. Glass,et al. SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.

[21] Jan Schlüter,et al. Efficient Training of Audio Transformers with Patchout , 2021, INTERSPEECH.

[22] Mark D. Plumbley,et al. Audio Captioning Transformer , 2021, DCASE.

[23] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.

[24] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[25] Aren Jansen,et al. The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Zeynep Akata,et al. Audio Retrieval with Natural Language Queries , 2021, Interspeech.

[27] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Grzegorz Chrupała. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques , 2021, J. Artif. Intell. Res..

[29] Aäron van den Oord,et al. Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.

[30] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.

[32] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Andrew Zisserman,et al. Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Nuno Vasconcelos,et al. Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] João F. Henriques,et al. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Dima Damen,et al. Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[38] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[39] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[40] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.

[41] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[42] Saniat Javid Sohrawardi,et al. Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection , 2020, IEEE Journal of Selected Topics in Signal Processing.

[43] Anurag Kumar,et al. Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data , 2020, IJCAI.

[44] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[45] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Geoffrey Zweig,et al. On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[49] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.

[50] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Yonglong Tian,et al. Contrastive Representation Distillation , 2019, ICLR.

[53] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54] Jang Hyun Cho,et al. On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[57] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58] Geoffrey E. Hinton,et al. When Does Label Smoothing Help? , 2019, NeurIPS.

[59] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.

[60] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[63] Yan Lu,et al. Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[66] Ross B. Girshick,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[67] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[68] W. Freeman,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[69] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[70] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.

[71] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[72] Asit K. Mishra,et al. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.

[73] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[74] Graham W. Taylor,et al. Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[75] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[76] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[77] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[78] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[79] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[81] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[82] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[83] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[84] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[85] Gregory Shakhnarovich,et al. FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[86] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[88] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[89] Honglak Lee,et al. Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[90] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[91] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[92] Aggelos K. Katsaggelos,et al. Audio-Visual Biometrics , 2006, Proceedings of the IEEE.

[93] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[94] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[95] Tsuhan Chen,et al. Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[96] Daniel McDuff,et al. Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.

[97] Daniel J. McDuff,et al. Contrastive Learning of Global and Local Video Representations , 2021, NeurIPS.

[98] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[99] Tomi Kinnunen,et al. A comparison of features for synthetic speech detection , 2015, INTERSPEECH.

[100] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[101] Eric David Petajan,et al. Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .