MAViL: Masked Audio-Video Learners
暂无分享,去创建一个
Chaitanya K. Ryali | J. Malik | Haoqi Fan | Christoph Feichtenhofer | Shang-Wen Li | Yanghao Li | Vasu Sharma | Hu Xu | Gargi Ghosh | Po-Yao (Bernie) Huang | C. Ryali
[1] Yusong Wu,et al. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[2] James R. Glass,et al. Contrastive Audio-Visual Masked Autoencoder , 2022, ICLR.
[3] Mohit Bansal,et al. TVLT: Textless Vision-Language Transformer , 2022, NeurIPS.
[4] Rongrong Ji,et al. Exploring Target Representations for Masked Autoencoders , 2022, ArXiv.
[5] Xiaohui Shen,et al. Contrastive Masked Autoencoders are Stronger Vision Learners , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[6] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.
[7] Dong Chen,et al. Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation , 2022, ArXiv.
[8] Tara N. Sainath,et al. Self-Supervised Speech Representation Learning: A Review , 2022, IEEE Journal of Selected Topics in Signal Processing.
[9] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.
[10] A. Zamir,et al. MultiMAE: Multi-modal Multi-task Masked Autoencoders , 2022, ECCV.
[11] David F. Harwath,et al. MAE-AST: Masked Autoencoding Audio Spectrogram Transformer , 2022, INTERSPEECH.
[12] Florian Metze,et al. AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification , 2022, INTERSPEECH.
[13] Lingxi Xie,et al. MVP: Multimodality-guided Visual Pre-training , 2022, ECCV.
[14] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.
[15] S. Dubnov,et al. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Abdel-rahman Mohamed,et al. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction , 2022, ICLR.
[17] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[18] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] James R. Glass,et al. SSAST: Self-Supervised Audio Spectrogram Transformer , 2021, AAAI.
[21] Jan Schlüter,et al. Efficient Training of Audio Transformers with Patchout , 2021, INTERSPEECH.
[22] Mark D. Plumbley,et al. Audio Captioning Transformer , 2021, DCASE.
[23] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.
[24] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.
[25] Aren Jansen,et al. The Benefit of Temporally-Strong Labels in Audio Event Classification , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[26] Zeynep Akata,et al. Audio Retrieval with Natural Language Queries , 2021, Interspeech.
[27] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Grzegorz Chrupała. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques , 2021, J. Artif. Intell. Res..
[29] Aäron van den Oord,et al. Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.
[30] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[32] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Andrew Zisserman,et al. Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Nuno Vasconcelos,et al. Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] João F. Henriques,et al. Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[36] Dima Damen,et al. Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[37] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[38] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[39] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[40] James R. Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.
[41] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.
[42] Saniat Javid Sohrawardi,et al. Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection , 2020, IEEE Journal of Selected Topics in Signal Processing.
[43] Anurag Kumar,et al. Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data , 2020, IJCAI.
[44] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[45] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[46] N. Vasconcelos,et al. Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Geoffrey Zweig,et al. On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[49] Yong Jae Lee,et al. Audiovisual SlowFast Networks for Video Recognition , 2020, ArXiv.
[50] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[51] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Yonglong Tian,et al. Contrastive Representation Distillation , 2019, ICLR.
[53] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[54] Jang Hyun Cho,et al. On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[55] Dima Damen,et al. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[56] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[57] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[58] Geoffrey E. Hinton,et al. When Does Label Smoothing Help? , 2019, NeurIPS.
[59] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[60] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[61] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[62] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.
[63] Yan Lu,et al. Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[65] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[66] Ross B. Girshick,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[67] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[68] W. Freeman,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..
[69] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.
[70] James R. Glass,et al. Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, International Journal of Computer Vision.
[71] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[72] Asit K. Mishra,et al. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy , 2017, ICLR.
[73] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[74] Graham W. Taylor,et al. Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.
[75] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[76] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[77] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[78] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.
[79] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[80] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[81] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[82] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[83] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[84] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[85] Gregory Shakhnarovich,et al. FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.
[86] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[87] Karol J. Piczak. ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.
[88] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[89] Honglak Lee,et al. Deep learning for robust feature generation in audiovisual emotion recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[90] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.
[91] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.
[92] Aggelos K. Katsaggelos,et al. Audio-Visual Biometrics , 2006, Proceedings of the IEEE.
[93] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[94] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.
[95] Tsuhan Chen,et al. Audio-visual integration in multimodal communication , 1998, Proc. IEEE.
[96] Daniel McDuff,et al. Active Contrastive Learning of Audio-Visual Video Representations , 2021, ICLR.
[97] Daniel J. McDuff,et al. Contrastive Learning of Global and Local Video Representations , 2021, NeurIPS.
[98] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[99] Tomi Kinnunen,et al. A comparison of features for synthetic speech detection , 2015, INTERSPEECH.
[100] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .
[101] Eric David Petajan,et al. Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .