It Takes Two to Tango: Towards Theory of AI's Mind

Theory of Mind is the ability to attribute mental states (beliefs, intents, knowledge, perspectives, etc.) to others and recognize that these mental states may differ from one's own. Theory of Mind is critical to effective communication and to teams demonstrating higher collective performance. To effectively leverage the progress in Artificial Intelligence (AI) to make our lives more productive, it is important for humans and AI to work well together in a team. Traditionally, there has been much emphasis on research to make AI more accurate, and (to a lesser extent) on having it better understand human intentions, tendencies, beliefs, and contexts. The latter involves making AI more human-like and having it develop a theory of our minds. In this work, we argue that for human-AI teams to be effective, humans must also develop a theory of AI's mind (ToAIM) - get to know its strengths, weaknesses, beliefs, and quirks. We instantiate these ideas within the domain of Visual Question Answering (VQA). We find that using just a few examples (50), lay people can be trained to better predict responses and oncoming failures of a complex VQA model. We further evaluate the role existing explanation (or interpretability) modalities play in helping humans build ToAIM. Explainable AI has received considerable scientific and popular attention in recent times. Surprisingly, we find that having access to the model's internal states - its confidence in its top-k predictions, explicit or implicit attention maps which highlight regions in the image (and words in the question) the model is looking at (and listening to) while answering a question about an image - do not help people better predict its behavior.

[1]  Ali Farhadi,et al.  Predicting Failures of Vision Systems , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Peter Robinson,et al.  Mind reading machines: automated inference of cognitive mental states from video , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[3]  Dan Klein,et al.  Reasoning about Pragmatics with Neural Listeners and Speakers , 2016, EMNLP.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  S. Doyle-Lindrud,et al.  Watson will see you now: a supercomputer to help clinicians make informed treatment decisions. , 2015, Clinical journal of oncology nursing.

[6]  Antonio Torralba,et al.  Who is Mistaken? , 2016, ArXiv.

[7]  Bernard Ghanem,et al.  What Makes an Object Memorable? , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  S. Baron-Cohen,et al.  The "Reading the Mind in the Eyes" Test revised version: a study with normal adults, and adults with Asperger syndrome or high-functioning autism. , 2001, Journal of child psychology and psychiatry, and allied disciplines.

[9]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[10]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[11]  Ramprasaath R. Selvaraju,et al.  Counting Everyday Objects in Everyday Scenes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  D. Meyer,et al.  Supporting Online Material Materials and Methods Som Text Figs. S1 to S6 References Evidence for a Collective Intelligence Factor in the Performance of Human Groups , 2022 .

[13]  G. Reeke The society of mind , 1991 .

[14]  Rebecca Q. Stafford,et al.  Robots with Display Screens: A Robot with a More Humanlike Face Display Is Perceived To Have More Mind and a Better Personality , 2013, PloS one.

[15]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yash Goyal,et al.  Towards Transparent AI Systems: Interpreting Visual Question Answering Models , 2016, 1608.08974.

[18]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernd Brügge,et al.  Visual storytelling , 2015, EuroPLoP '13.

[20]  Karl Stratos,et al.  Understanding and predicting importance in images , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[24]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[25]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jason P. Mitchell Inferences about mental states , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[27]  Christopher D. Manning,et al.  Learning Language Games through Interaction , 2016, ACL.

[28]  Meredith Ringel Morris,et al.  Understanding Blind People's Experiences with Computer-Generated Captions of Social Media Images , 2017, CHI.

[29]  Peter Robinson,et al.  Real-Time Inference of Complex Mental States from Facial Expressions and Head Gestures , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[30]  Ali Farhadi,et al.  Towards Transparent Systems: Semantic Characterization of Failure Modes , 2014, ECCV.

[31]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[35]  C. Chabris,et al.  Reading the Mind in the Eyes or Reading between the Lines? Theory of Mind Predicts Collective Intelligence Equally Well Online and Face-To-Face , 2014, PloS one.

[36]  Gordon Christie,et al.  Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions , 2016, EMNLP.

[37]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[38]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[39]  Brian Scassellati,et al.  Theory of Mind for a Humanoid Robot , 2002, Auton. Robots.

[40]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[41]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[42]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[43]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[44]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[45]  M. Tomasello,et al.  Great apes anticipate that other individuals will act according to false beliefs , 2016, Science.

[46]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[47]  Marvin Minsky,et al.  Society of Mind: A Response to Four Reviews , 1991, Artif. Intell..

[48]  Trevor Darrell,et al.  Attentive Explanations: Justifying Decisions and Pointing to the Evidence , 2016, ArXiv.

[49]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[50]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Rob Miller,et al.  VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[52]  Dhruv Batra,et al.  Sort Story: Sorting Jumbled Images and Captions into Stories , 2016, EMNLP.

[53]  Shaomei Wu,et al.  Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service , 2017, CSCW.

[54]  Barbara J. Grosz,et al.  What Question Would Turing Pose Today? , 2012, AI Mag..

[55]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[56]  H. Wimmer,et al.  Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception , 1983, Cognition.

[57]  Antonio Torralba,et al.  Predicting Motivations of Actions by Leveraging Text , 2014, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Bilge Mutlu,et al.  A Storytelling Robot: Modeling and Evaluation of Human-like Gaze Behavior , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[59]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[60]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[61]  M. Tomasello,et al.  Does the chimpanzee have a theory of mind? 30 years later , 2008, Trends in Cognitive Sciences.

[62]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[63]  Yoshua Bengio,et al.  Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks , 2015, IEEE Transactions on Multimedia.

[64]  Ashwin K. Vijayakumar,et al.  We are Humor Beings: Understanding and Predicting Visual Humor , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[66]  付伶俐 打磨Using Language,倡导新理念 , 2014 .

[67]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[68]  Hye-Won Song,et al.  A Review of Computer Vision Methods for Purpose on Computer-Aided Diagnosis , 2016 .

[69]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[70]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Mathias Broth,et al.  Why That Nao?: How Humans Adapt to a Conventional Humanoid Robot in Taking Turns-at-Talk , 2016, CHI.

[72]  Kristen Grauman,et al.  Implied Feedback: Learning Nuances of User Behavior in Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[73]  Susan R. Fussell,et al.  How people anthropomorphize robots , 2008, 2008 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[74]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[75]  Erik T. Mueller,et al.  Watson: Beyond Jeopardy! , 2013, Artif. Intell..

[76]  S. Baron-Cohen The evolution of a theory of mind. , 1999 .