MarioQA: Answering Questions by Watching Gameplay Videos

We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inferences. Hence, it is difficult to measure capacity and flexibility of trained models, and existing techniques often rely on adhoc implementations of deep neural networks without clear insight into datasets and models. We are particularly interested in understanding temporal relationships between video events to solve VideoQA problems; this is because reasoning temporal dependency is one of the most distinct components in videos from images. To address this objective, we automatically generate a customized synthetic VideoQA dataset using Super Mario Bros. gameplay videos so that it contains events with different levels of reasoning complexity. Using the dataset, we show that properly constructed datasets with events in various complexity levels are critical to learn effective models and improve overall performance.

[1]  Saurabh Singh,et al.  Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[6]  Gang Wang,et al.  Multi-Task CNN Model for Attribute Prediction , 2015, IEEE Transactions on Multimedia.

[7]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[10]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[11]  Bohyung Han,et al.  Image Question Answering Using Convolutional Neural Network with Dynamic Parameter Prediction , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[13]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[15]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[16]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[22]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[23]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[24]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[25]  Yi Yang,et al.  Uncovering Temporal Context for Video Question and Answering , 2015, ArXiv.

[26]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[27]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yang Zhong,et al.  Face attribute prediction using off-the-shelf CNN features , 2016, 2016 International Conference on Biometrics (ICB).

[29]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[30]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[31]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[32]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.