Toward a Human-Level Video Understanding Intelligence

We aim to develop an AI agent that can watch video clips and have a conversation with human about the video story. Developing video understanding intelligence is a significantly challenging task, and evaluation methods for adequately measuring and analyzing the progress of AI agent are lacking as well. In this paper, we propose the Video Turing Test to provide effective and practical assessments of video understanding intelligence as well as human-likeness evaluation of AI agents. We define a general format and procedure of the Video Turing Test and present a case study to confirm the effectiveness and usefulness of the proposed test.

[1]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[2]  Kyoung-Woon On,et al.  Constructing Hierarchical Q&A Datasets for Video Story Understanding , 2019, ArXiv.

[3]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[5]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[6]  Jerry Feldman Mind as Machine: A History of Cognitive Science, Margaret Boden. Oxford U. Press (2006) , 2007 .

[7]  Byoung-Tak Zhang,et al.  Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering , 2021, ArXiv.

[8]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[9]  Byoung-Tak Zhang,et al.  Teaching an Agent by Playing a Multimodal Memory Game: Challenges for Machine Learners and Human Teachers , 2009, AAAI Spring Symposium: Agents that Learn from Human Teachers.

[10]  Gustavo Olague,et al.  Less is More: Pursuing the Visual Turing Test with the Kuleshov Effect , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  J. Piaget Intellectual Evolution from Adolescence to Adulthood , 1972 .

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Seongho Choi,et al.  DramaQA: Character-Centered Video Story Understanding with Hierarchical QA , 2021, AAAI.

[14]  Joseph Weizenbaum,et al.  ELIZA—a computer program for the study of natural language communication between man and machine , 1966, CACM.

[15]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[16]  Murray Campbell,et al.  I-athlon: Towards A Multidimensional Turing Test , 2016, AI Mag..

[17]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[18]  Stuart M. Shieber,et al.  Lessons from a restricted Turing test , 1994, CACM.

[19]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yu-Jung Heo,et al.  CogME: A Novel Evaluation Metric for Video Understanding Intelligence , 2021, ArXiv.