Adversarial Multimodal Network for Movie Story Question Answering

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions) based on generative adversarial networks. Moreover, a self-attention mechanism is developed to enforce our newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the benchmark MovieQA and TVQA datasets show the effectiveness of our proposed AMN over other published state-of-the-art methods.

[1]  Bohyung Han,et al.  MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Chuan-Sheng Foo,et al.  Holistic Multi-Modal Memory Network for Movie Question Answering , 2018, IEEE Transactions on Image Processing.

[3]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[4]  Shagun Sodhani,et al.  Survey of Recent Advances in Visual Question Answering , 2017, ArXiv.

[5]  Chang D. Yoo,et al.  Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[6]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Meng Wang,et al.  Question-Aware Tube-Switch Network for Video Question Answering , 2019, ACM Multimedia.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xirong Li,et al.  Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild , 2017, IEEE Transactions on Multimedia.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Xiaojie Wang,et al.  Object-Difference Attention: A Simple Relational Attention for Visual Question Answering , 2018, ACM Multimedia.

[15]  Jun Xiao,et al.  Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network , 2018, IJCAI.

[16]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Yongfeng Huang,et al.  Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval , 2017, IEEE Transactions on Multimedia.

[19]  Hanqing Lu,et al.  Erasing-based Attention Learning for Visual Question Answering , 2019, ACM Multimedia.

[20]  Larry P. Heck,et al.  Generative Visual Dialogue System via Weighted Likelihood Estimation , 2019, IJCAI.

[21]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[24]  Jong-Hoon Oh,et al.  Open-Domain Why-Question Answering with Adversarial Learning to Encode Answer Texts , 2019, ACL.

[25]  Zhou Zhao,et al.  Multi-interaction Network with Object Relation for Video Question Answering , 2019, ACM Multimedia.

[26]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[27]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[28]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[29]  Yonatan Belinkov,et al.  Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[30]  Philip H. S. Torr,et al.  FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Wei Zhao,et al.  Multitask Learning for Cross-Domain Image Captioning , 2019, IEEE Transactions on Multimedia.

[32]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[33]  Madian Khabsa,et al.  Adversarial Training for Community Question Answer Selection Based on Multi-scale Matching , 2018, AAAI.

[34]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[35]  Wen Gao,et al.  Supervised Distributed Hashing for Large-Scale Multimedia Retrieval , 2018, IEEE Transactions on Multimedia.

[36]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Junyeong Kim,et al.  Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Nojun Kwak,et al.  Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension , 2018, ACL.

[39]  Nojun Kwak,et al.  Dynamic Graph Generation Network: Generating Relational Knowledge from Diagrams , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[41]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jiashi Feng,et al.  Generative Attention Model with Adversarial Self-learning for Visual Question Answering , 2017, ACM Multimedia.

[44]  Wen Gao,et al.  Multiscale Deep Alternative Neural Network for Large-Scale Video Classification , 2018, IEEE Transactions on Multimedia.

[45]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Olivier Pietquin,et al.  End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.

[48]  Jun Zhao,et al.  Conditional Generative Adversarial Networks for Commonsense Machine Comprehension , 2017, IJCAI.

[49]  Bo Wang,et al.  Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents , 2018, AAAI.

[50]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[51]  Yang Yang,et al.  CRA-Net: Composed Relation Attention Network for Visual Question Answering , 2019, ACM Multimedia.

[52]  Zhou Yu,et al.  Compositional Attention Networks With Two-Stream Fusion for Video Question Answering , 2020, IEEE Transactions on Image Processing.

[53]  Jingkuan Song,et al.  Learnable Aggregating Net with Diversity Learning for Video Question Answering , 2019, ACM Multimedia.

[54]  Raffaella Bernardi,et al.  Ask No More: Deciding when to guess in referential visual dialogue , 2018, COLING.

[55]  Jun Zhu,et al.  Textbook Question Answering Under Instructor Guidance with Memory Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Donggyu Kim,et al.  Domain-agnostic Question-Answering with Adversarial Training , 2019, EMNLP.

[57]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.