CogME: A Novel Evaluation Metric for Video Understanding Intelligence

Developing video understanding intelligence is quite challenging because it requires holistic integration of images, scripts, and sounds based on natural language processing, temporal dependency, and reasoning. Recently, substantial attempts have been made on several video datasets with associated question answering (QA) on a large scale. However, existing evaluation metrics for video question answering (VideoQA) do not provide meaningful analysis. To make progress, we argue that a well-made framework, established on the way humans understand, is required to explain and evaluate the performance of understanding in detail. Then we propose a top-down evaluation system for VideoQA, based on the cognitive process of humans and story elements: Cognitive Modules for Evaluation (CogME). CogME is composed of three cognitive modules: targets, contents, and thinking. The interaction among the modules in the understanding procedure can be expressed in one sentence as follows: “I understand the CONTENT of the TARGET through a way of THINKING.” Each module has sub-components derived from the story elements. We can specify the required aspects of understanding by annotating the sub-components to individual questions. CogME thus provides a framework for an elaborated specification of VideoQA datasets. To examine the suitability of a VideoQA dataset for validating video understanding intelligence, we evaluated the baseline model of the DramaQA dataset by applying CogME. The evaluation reveals that story elements are unevenly reflected in the existing dataset, and the model based on the dataset may cause biased predictions. Although this study has only been able to grasp a narrow range of stories, we expect that it offers the first step in considering the cognitive process of humans on the video understanding intelligence of humans and AI.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  E. Tan A psychology of the film , 2018, Palgrave Communications.

[3]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[4]  Simon Ostermann,et al.  SemEval-2018 Task 11: Machine Comprehension Using Commonsense Knowledge , 2018, *SEMEVAL.

[5]  J. Gibson The Ecological Approach to Visual Perception , 1979 .

[6]  Kyoung-Woon On,et al.  Constructing Hierarchical Q&A Datasets for Video Story Understanding , 2019, ArXiv.

[7]  Chenhui Chu,et al.  KnowIT VQA: Answering Knowledge-Based Questions about Videos , 2020, AAAI.

[8]  Danqi Chen,et al.  A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[9]  Max Coltheart,et al.  Modularity and cognition , 1999, Trends in Cognitive Sciences.

[10]  Rolf A. Zwaan,et al.  The Construction of Situation Models in Narrative Comprehension: An Event-Indexing Model , 1995 .

[11]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[12]  Sameer Singh,et al.  Evaluating Question Answering Evaluation , 2019, EMNLP.

[13]  Kevin Lin,et al.  Reasoning Over Paragraph Effects in Situations , 2019, MRQA@EMNLP.

[14]  Devshree Patel,et al.  Recent Advances in Video Question Answering: A Review of Datasets and Methods , 2021, ICPR Workshops.

[15]  Mitesh M. Khapra,et al.  Towards a Better Metric for Evaluating Question Generation Systems , 2018, EMNLP.

[16]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[17]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[18]  Hossein Amirkhani,et al.  A Survey on Machine Reading Comprehension Systems , 2020, Natural Language Engineering.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Lynette Hirschman,et al.  Deep Read: A Reading Comprehension System , 1999, ACL.

[21]  Christopher J.C. Burges,et al.  Towards the Machine Comprehension of Text: An Essay , 2013 .

[22]  Byoung-Tak Zhang,et al.  Co-Attentional Transformers for Story-Based Video Understanding , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  T. Trabasso,et al.  Constructing inferences during narrative text comprehension. , 1994, Psychological review.

[24]  William F. Brewer,et al.  The story schema : universal and culture-specific properties , 1984 .

[25]  Seongho Choi,et al.  DramaQA: Character-Centered Video Story Understanding with Hierarchical QA , 2021, AAAI.

[26]  Robert P. Schumaker From Data to Wisdom: The Progression of Computational Learning in Text Mining , 2014, Communications of the IIMA.

[27]  P. Thorndyke Cognitive structures in comprehension and memory of narrative discourse , 1977, Cognitive Psychology.

[28]  Jennifer Chu-Carroll,et al.  To Test Machine Comprehension, Start by Defining Comprehension , 2020, ACL.

[29]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.