A comparative study of language transformers for video question answering
暂无分享,去创建一个
Chenhui Chu | Mayu Otani | Yuta Nakashima | Haruo Takemura | Noa Garcia | Zekun Yang | H. Takemura | Chenhui Chu | Mayu Otani | Yuta Nakashima | Zekun Yang | Noa García
[1] Yue-Shi Lee,et al. CLVQ: cross-language video question/answering system , 2004, IEEE Sixth International Symposium on Multimedia Software Engineering.
[2] Byoung-Tak Zhang,et al. DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.
[3] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[4] Chenhui Chu,et al. BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[5] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.
[6] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Byoung-Tak Zhang,et al. Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.
[8] Kate Saenko,et al. Language Features Matter: Effective Language Representations for Vision-Language Tasks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[9] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[10] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.
[11] Juan Carlos Niebles,et al. Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.
[12] Chenhui Chu,et al. KnowIT VQA: Answering Knowledge-Based Questions about Videos , 2020, AAAI.
[13] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.
[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[15] Shashank Shekhar,et al. OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).
[16] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[17] Jun Xiao,et al. Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.
[18] Noah A. Smith,et al. Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.
[19] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[20] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[21] Junyeong Kim,et al. Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Ramakant Nevatia,et al. Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[23] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[24] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[26] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[27] Vicente Ordonez,et al. Obj2Text: Generating Visually Descriptive Language from Object Layouts , 2017, EMNLP.
[28] Yi Yang,et al. Revisiting EmbodiedQA: A Simple Baseline and Beyond , 2019, IEEE Transactions on Image Processing.
[29] Yu-Chieh Wu,et al. A Robust Passage Retrieval Algorithm for Video Question Answering , 2008, IEEE Transactions on Circuits and Systems for Video Technology.
[30] Yi Yang,et al. Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.
[31] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[33] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[34] Ming Zhou,et al. Reasoning Over Semantic-Level Graph for Fact Checking , 2020, ACL.
[35] Xinlei Chen,et al. Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Yoshua Bengio,et al. Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.
[37] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Matteo Negri,et al. Adapting Transformer to End-to-End Spoken Language Translation , 2019, INTERSPEECH.
[39] Wei Zhang,et al. R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.
[40] Bohyung Han,et al. MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[41] Xinlei Chen,et al. Multi-Target Embodied Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[43] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[44] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[45] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[46] Trevor Darrell,et al. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.
[48] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[49] Matthieu Cord,et al. MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Ting Liu,et al. Story Ending Prediction by Transferable BERT , 2019, IJCAI.
[51] Mohit Dua,et al. An Improved Automated Question Answering System from Lecture Videos , 2019 .
[52] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.