Audiovisual Masked Autoencoders
暂无分享,去创建一个
Radu Tudor Ionescu | C. Schmid | Mario Lucic | Anurag Arnab | Eduardo Fonseca | Mariana-Iuliana Georgescu
[1] D. Damen,et al. Play It Back: Iterative Attention for Audio Recognition , 2022, ArXiv.
[2] G. Csurka,et al. CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion , 2022, ArXiv.
[3] Michael Auli,et al. Masked Autoencoders that Listen , 2022, NeurIPS.
[4] Kalyan Vasudev Alwala,et al. OmniMAE: Single Model Masked Pretraining on Images and Videos , 2022, ArXiv.
[5] Haoqi Fan,et al. Masked Autoencoders As Spatiotemporal Learners , 2022, NeurIPS.
[6] Dading Chong,et al. Masked Spectrogram Prediction for Self-Supervised Audio Pre-Training , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.
[7] A. Zamir,et al. MultiMAE: Multi-modal Multi-task Masked Autoencoders , 2022, ECCV.
[8] Limin Wang,et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.
[9] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[10] L. V. D. Maaten,et al. Omnivore: A Single Model for Many Visual Modalities , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Yu-Gang Jiang,et al. BEVT: BERT Pretraining of Video Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Jian Ma,et al. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100 , 2021, Int. J. Comput. Vis..
[17] Olivier J. H'enaff,et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs , 2021, ICLR.
[18] Fabian Caba Heilbron,et al. OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context , 2022, ArXiv.
[19] Krzysztof Choromanski,et al. PolyViT: Co-training Vision Transformers on Images, Videos and Audio , 2021, Trans. Mach. Learn. Res..
[20] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[21] John Canny,et al. Compressive Visual Representations , 2021, NeurIPS.
[22] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[23] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.
[24] Andrea Vedaldi,et al. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.
[25] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[26] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[27] Aäron van den Oord,et al. Multimodal Self-Supervised Learning of General Audio Representations , 2021, ArXiv.
[28] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[29] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[30] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Dima Damen,et al. Slow-Fast Auditory Streams for Audio Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[32] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[33] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[34] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[35] James R. Glass,et al. PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[36] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[37] Andreas Dengel,et al. ESResNet: Environmental Sound Classification Based on Visual Domain Models , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).
[38] Geoffrey Zweig,et al. On Compositions of Transformations in Contrastive Self-Supervised Learning , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[40] Andrea Vedaldi,et al. Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.
[41] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.
[42] Anurag Kumar,et al. Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data , 2020, IJCAI.
[43] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[44] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[45] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[46] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[47] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[49] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Yuki M. Asano,et al. Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.
[51] Du Tran,et al. What Makes Training Multi-Modal Classification Networks Hard? , 2019, Computer Vision and Pattern Recognition.
[52] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.
[53] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[54] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[55] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.
[56] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[57] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.
[58] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.
[59] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[60] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[61] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[62] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[63] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[64] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[65] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[66] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[68] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[70] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.
[71] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[73] Grzegorz Gwardys,et al. Deep Image Features in Music Information Retrieval , 2014 .
[74] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[75] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[76] L. Shams,et al. Crossmodal influences on visual perception. , 2010, Physics of life reviews.
[77] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[78] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.
[79] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[80] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[81] Michael Gasser,et al. The Development of Embodied Cognition: Six Lessons from Babies , 2005, Artificial Life.