Movie Question Answering via Textual Memory and Plot Graph

Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we introduce a new dataset called PlotGraphs, as external knowledge. The dataset contains massive graph-based information of movies. In addition, we put forward a model that can utilize movie clip, subtitle, and graph-based external knowledge. The model contains two main parts: a layered memory network (LMN) and a plot graph representation network (PGRN). In particular, the LMN can represent frame-level and clip-level movie content by the fixed word memory module and the adaptive subtitle memory module, respectively. And the plot graph representation network can represent the entire graph. We first extract words and sentences from the training movie subtitles and then the hierarchically formed movie representations, which are learned from LMN. At the same time, the PGRN can represent the semantic information and the relationships in the graph. We conduct extensive experiments on the MovieQA dataset and the PlotGraphs dataset. With only visual content as inputs, the LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of “Video+Subtitles.” After the integration of external knowledge, the performance of the model consisting of the LMN and the PGRN is further improved. The good performance successfully demonstrates that the external knowledge and the proposed model are effective for movie understanding.

[1]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[2]  Qi Tian,et al.  Sequential Video VLAD: Training the Aggregation Locally and Temporally , 2018, IEEE Transactions on Image Processing.

[3]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[4]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[5]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Zheng Wang,et al.  Catching the Temporal Regions-of-Interest for Video Captioning , 2017, ACM Multimedia.

[7]  Rada Mihalcea,et al.  Speaker Naming in Movies , 2018, NAACL.

[8]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[13]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[15]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaoyu Du,et al.  Adversarial Personalized Ranking for Recommendation , 2018, SIGIR.

[17]  Xinlei Chen,et al.  Iterative Visual Reasoning Beyond Convolutions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Byoung-Tak Zhang,et al.  Multimodal Dual Attention Memory for Video Story Question Answering , 2018, ECCV.

[19]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wenwu Zhu,et al.  Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks , 2017, ArXiv.

[22]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[23]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[24]  Chitta Baral,et al.  Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering , 2018, AAAI.

[25]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[27]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[28]  Yi Yang,et al.  Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Junwei Han,et al.  CNNs-Based RGB-D Saliency Detection via Cross-View Transfer and Multiview Fusion. , 2018, IEEE transactions on cybernetics.

[31]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Junwei Han,et al.  A Unified Metric Learning-Based Framework for Co-Saliency Detection , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[37]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[40]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[41]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[42]  Huanbo Luan,et al.  Modeling Relation Paths for Representation Learning of Knowledge Bases , 2015, EMNLP.

[43]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Zhongchao Shi,et al.  Video Captioning with Listwise Supervision , 2017, AAAI.

[45]  Ming Gao,et al.  BiRank: Towards Ranking on Bipartite Graphs , 2017, IEEE Transactions on Knowledge and Data Engineering.

[46]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[47]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[49]  Bohyung Han,et al.  MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[51]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).