Inferring the Why in Images

Abstract : Humans have the remarkable capability to infer the motivations of other people's actions, likely due to cognitive skills known in psychophysics as the theory of mind. In this paper, we strive to build a computational model that predicts the motivation behind the actions of people from images. To our knowledge, this challenging problem has not yet been extensively explored in computer vision. We present a novel learning based framework that uses high-level visual recognition to infer why people are performing an actions in images. However, the information in an image alone may not be sufficient to automatically solve this task. Since humans can rely on their own experiences to infer motivation, we propose to give computer vision systems access to some of these experiences by using recently developed natural language models to mine knowledge stored in massive amounts of text. While we are still far away from automatically inferring motivation, our results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image.

[1]  E. Lawler A PROCEDURE FOR COMPUTING THE K BEST SOLUTIONS TO DISCRETE OPTIMIZATION PROBLEMS AND ITS APPLICATION TO THE SHORTEST PATH PROBLEM , 1972 .

[2]  H. Wimmer,et al.  Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children's understanding of deception , 1983, Cognition.

[3]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[4]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[5]  A. ADoefaa,et al.  ? ? ? ? f ? ? ? ? ? , 2003 .

[6]  R Saxe,et al.  People thinking about thinking people The role of the temporo-parietal junction in “theory of mind” , 2003, NeuroImage.

[7]  S. Carey,et al.  Understanding other minds: linking developmental psychology and functional neuroimaging. , 2004, Annual review of psychology.

[8]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[9]  Chris L. Baker,et al.  Action understanding as inverse planning , 2009, Cognition.

[10]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[11]  Katja Markert,et al.  Learning Models for Object Recognition from Natural Language Descriptions , 2009, BMVC.

[12]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[13]  Miao Li,et al.  Dark Energy , 2011, Dialogue: A Journal of Mormon Thought.

[14]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[15]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[16]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[17]  J. Einasto Dark Matter , 2009, 0901.0632.

[18]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[19]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[20]  Song-Chun Zhu,et al.  C^4: Exploring Multiple Solutions in Graphical Models by Cluster Sampling , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[22]  Ella M. Atkins,et al.  Human Intent Prediction Using Markov Decision Processes , 2012, J. Aerosp. Inf. Syst..

[23]  Monica N. Nicolescu,et al.  Deep networks for predicting human intent with respect to objects , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[24]  Gregory Shakhnarovich,et al.  Diverse M-Best Solutions in Markov Random Fields , 2012, ECCV.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Raffaella Bernardi,et al.  Exploiting language models to recognize unseen actions , 2013, ICMR '13.

[29]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Jaime Valls Miró,et al.  Language for learning complex human-object interactions , 2013, 2013 IEEE International Conference on Robotics and Automation.

[32]  Darius Burschka,et al.  Predicting human intention in visual observations of hand/object interactions , 2013, 2013 IEEE International Conference on Robotics and Automation.

[33]  Song-Chun Zhu,et al.  Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Jos Elfring,et al.  Learning intentions for improved human motion prediction , 2013, 2013 16th International Conference on Advanced Robotics (ICAR).

[35]  Jos Elfring,et al.  Learning intentions for improved human motion prediction , 2013, 2013 16th International Conference on Advanced Robotics (ICAR).

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[38]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Song-Chun Zhu,et al.  Visual Persuasion: Inferring Communicative Intents of Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.