Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.

[1]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Gunhee Kim,et al.  A Read-Write Memory Network for Movie Story Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[5]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Zhongchao Shi,et al.  Video Captioning with Listwise Supervision , 2017, AAAI.

[7]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[8]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[9]  Zheng Wang,et al.  Catching the Temporal Regions-of-Interest for Video Captioning , 2017, ACM Multimedia.

[10]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[11]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Byoung-Tak Zhang,et al.  DeepStory: Video Story QA by Deep Embedded Memory Networks , 2017, IJCAI.

[15]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[16]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bohyung Han,et al.  MarioQA: Answering Questions by Watching Gameplay Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[19]  Lin Ma,et al.  Learning to Answer Questions from Image Using Convolutional Neural Network , 2015, AAAI.

[20]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[21]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[23]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Tairui Chen,et al.  Going Deeper with Convolutional Neural Network for Intelligent Transportation , 2016 .

[27]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[28]  Li Fei-Fei,et al.  Knowledge Acquisition for Visual Question Answering via Iterative Querying , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.