McOmet: Multimodal Fusion Transformer for Physical Audiovisual Commonsense Reasoning
暂无分享,去创建一个
[1] Paul Pu Liang,et al. PACS: A Dataset for Physical Audiovisual CommonSense Reasoning , 2022, ECCV.
[2] Yejin Choi,et al. MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Weizhu Chen,et al. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.
[4] Yang Wang,et al. Joint Visual and Audio Learning for Video Highlight Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[5] J. Chai,et al. Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding , 2021, EMNLP.
[6] Depu Meng,et al. Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] C. Schmid,et al. Attention Bottlenecks for Multimodal Fusion , 2021, NeurIPS.
[8] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[10] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[11] Humphrey Shi,et al. Escaping the Big Data Paradigm with Compact Transformers , 2021, ArXiv.
[12] James R. Glass,et al. AST: Audio Spectrogram Transformer , 2021, Interspeech.
[13] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[14] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[15] Limin Wang,et al. TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Zhenjie Zhao,et al. Learning Physical Common Sense as Knowledge Graph Completion via BERT Data Augmentation and Constrained Tucker Factorization , 2020, EMNLP.
[17] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[18] Yagya Raj Pandeya,et al. Deep learning-based late fusion of multimodal information for emotion classification of music video , 2020, Multimedia Tools and Applications.
[19] Louis-Philippe Morency,et al. What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets , 2020, ArXiv.
[20] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[21] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[22] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.
[23] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[24] Yejin Choi,et al. Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.
[25] Louis-Philippe Morency,et al. Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Chang Zhou,et al. Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.
[27] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[29] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[30] David M. Mimno,et al. Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets , 2018, NAACL.
[31] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[32] Yejin Choi,et al. Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.
[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[34] Louis-Philippe Morency,et al. Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[35] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[36] Ali Farhadi,et al. Commonly Uncommon: Semantic Sparsity in Situation Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[38] Matthew R. Walter,et al. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.
[39] Stéphane Ayache,et al. Majority Vote of Diverse Classifiers for Late Fusion , 2014, S+SSPR.
[40] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[41] Radford M. Neal,et al. Pattern recognition and machine learning , 2019, Springer US.
[42] A. Linear-probe,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021 .
[43] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.