EgoVQA - An Egocentric Video Question Answering Benchmark Dataset

Recently, much effort and attention has been devoted to Visual Question Answering (VQA) on static images and Video Question Answering (VideoQA) on third-person videos. In the meantime, first-person question answering has more natural use cases while this topic remains seldom studied. A typical meaningful scenario is an intelligent agent provides assistance to handicapped people to perceive the environment by the queries, localize objects and persons based on descriptions, and identify intentions of surrounding people to guide their reactions (e.g., shake hands or avoid punches). However, due to the lack of first-person video datasets, seldom study had been carried on first-person VideoQA task. To address this issue, we collected a novel egocentric VideoQA dataset called EgoVQA with 600 question-answer pairs with visual contents across 5,000 frames from 16 first-person videos. Various types of queries such as "Who", "What", "How many" are provided to form a semantically rich corpus. We use this database to evaluate the performance of four mainstream third-person VideoQA methods to illustrate their performance gap between first-person related questions and third-person related questions. We believe that EgoVQA dataset will facilitate future research on the imperative task of first-person VideoQA.

[1]  Gedas Bertasius,et al.  Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[5]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  David J. Crandall,et al.  Enhancing Lifelogging Privacy by Detecting Screens , 2016, CHI.

[7]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[8]  Jianbo Shi,et al.  Am I a Baller? Basketball Performance Assessment from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Yoichi Sato,et al.  Predicting Gaze in Egocentric Video by Learning Task-dependent Attention Transition , 2018, ECCV.

[10]  Ramakant Nevatia,et al.  Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Rami Albatal,et al.  NTCIR Lifelog: The First Test Collection for Lifelog Research , 2016, SIGIR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Long Chen,et al.  Video Question Answering via Attribute-Augmented Attention Network Learning , 2017, SIGIR.

[16]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[17]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[20]  Yong Jae Lee,et al.  Identifying First-Person Camera Wearers in Third-Person Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhiwu Lu,et al.  Recursive Visual Attention in Visual Dialog , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Giovanni Maria Farinella,et al.  Recognizing Personal Contexts from Egocentric Images , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[24]  John R. Hershey,et al.  Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Michael S. Ryoo,et al.  Forecasting Hands and Objects in Future Frames , 2018, ECCV Workshops.

[28]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[29]  Kristen Grauman,et al.  Detecting Engagement in Egocentric Video , 2016, ECCV.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Shu Zhang,et al.  Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Linda B. Smith,et al.  An egocentric perspective on active vision and visual object learning in toddlers , 2017, 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob).

[33]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Petia Radeva,et al.  Egocentric video description based on temporally-linked sequences , 2018, J. Vis. Commun. Image Represent..

[35]  Giovanni Maria Farinella,et al.  Next-active-object prediction from egocentric videos , 2017, J. Vis. Commun. Image Represent..

[36]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Petia Radeva,et al.  Toward Storytelling From Visual Lifelogging: An Overview , 2015, IEEE Transactions on Human-Machine Systems.

[38]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39]  Michael S. Ryoo,et al.  Joint Person Segmentation and Identification in Synchronized First- and Third-person Videos , 2018, ECCV.

[40]  David J. Crandall,et al.  Deepdiary: Lifelogging image captioning and summarization , 2018, J. Vis. Commun. Image Represent..

[41]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Qi Zhao,et al.  Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Stefan Lee,et al.  Embodied Question Answering in Photorealistic Environments With Point Cloud Perception , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[46]  Shahram Izadi,et al.  SenseCam: A Retrospective Memory Aid , 2006, UbiComp.

[47]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[50]  Linda B. Smith,et al.  Toddler-Inspired Visual Object Learning , 2018, NeurIPS.

[51]  Yoichi Sato,et al.  Future Person Localization in First-Person Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Michael Riegler,et al.  Overview of ImageCLEFlifelog 2019: Solve My Life Puzzle and Lifelog Moment Retrieval , 2019, CLEF.

[53]  Alan F. Smeaton,et al.  An Examination of a Large Visual Lifelog , 2008, AIRS.

[54]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[56]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .