Revisiting the Evaluation of Theory of Mind through Question Answering

Theory of mind, i.e., the ability to reason about intents and beliefs of agents is an important task in artificial intelligence and central to resolving ambiguous references in natural language dialogue. In this work, we revisit the evaluation of theory of mind through question answering. We show that current evaluation methods are flawed and that existing benchmark tasks can be solved without theory of mind due to dataset biases. Based on prior work, we propose an improved evaluation protocol and dataset in which we explicitly control for data regularities via a careful examination of the answer space. We show that state-of-the-art methods which are successful on existing benchmarks fail to solve theory-of-mind tasks in our proposed approach.

[1]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[2]  P. Bloom How children learn the meanings of words , 2000 .

[3]  S. Baron-Cohen,et al.  Does the autistic child have a “theory of mind” ? , 1985, Cognition.

[4]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[5]  Siobhan Chapman Logic and Conversation , 2005 .

[6]  H. Wimmer,et al.  Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception , 1983, Cognition.

[7]  Jason Weston,et al.  Tracking the World State with Recurrent Entity Networks , 2016, ICLR.

[8]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[9]  Lucia Specia,et al.  Probing the Need for Visual Context in Multimodal Machine Translation , 2019, NAACL.

[10]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[11]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[12]  Thomas L. Griffiths,et al.  Evaluating Theory of Mind in Question Answering , 2018, EMNLP.

[13]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Thomas L. Griffiths,et al.  How Can Memory-Augmented Neural Networks Pass a False-Belief Task? , 2017, CogSci.

[15]  H. Wimmer,et al.  “John thinks that Mary thinks that…” attribution of second-order beliefs by 5- to 10-year-old children ☆ , 1985 .

[16]  R. Rosenthal,et al.  Clever Hans : the horse of Mr. Von Osten , 1911 .

[17]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[18]  D. Barr,et al.  Taking Perspective in Conversation: The Role of Mutual Knowledge in Comprehension , 2000, Psychological science.

[19]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[20]  A. Gopnik,et al.  Children's understanding of representational change and its relation to the understanding of false belief and the appearance-reality distinction. , 1988, Child development.

[21]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.