Situated Grounding Facilitates Multimodal Concept Learning for AI

Peer-to-peer human-computer interactions require a minimum level of capability that remains beyond current unimodal approaches. Computers must recognize and generate communicative acts within multiple modalities, understand the grounding of communicative acts within the shared context and situation of both interlocutors, and appreciate the consequences of behavior and actions within the interaction. In this short paper, we discuss an approach to interactive concept learning using multimodal simulations that situate and contextualize the interaction, thereby visually demonstrating what the computer believes and understands. We examine an example of situated grounding in a collaborative task, and its uses in probing learned models and interactive learning.

[1]  R. Shaw,et al.  Perceiving, Acting and Knowing : Toward an Ecological Psychology , 1978 .

[2]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[3]  Anthony G. Cohn,et al.  A Spatial Logic based on Regions and Connection , 1992, KR.

[4]  三嶋 博之 The theory of affordances , 2008 .

[5]  Qiang Ji,et al.  Learning Bayesian Networks with qualitative constraints , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Stefanie Tellex,et al.  Object schemas for grounding language in a responsive robot , 2008, Connect. Sci..

[7]  Matthias Scheutz,et al.  What to do and how to do it: Translating natural language directives into temporal and dynamic logic representation for goal management and action execution , 2009, 2009 IEEE International Conference on Robotics and Automation.

[8]  Angelo Cangelosi,et al.  Grounding language in action and perception: from cognitive agents to humanoid robots. , 2010, Physics of life reviews.

[9]  Alexander I. Rudnicky,et al.  Towards evaluating recovery strategies for situated grounding problems in human-robot dialogue , 2013, 2013 IEEE RO-MAN.

[10]  J. Pustejovsky Dynamic Event Structure and Habitat Theory , 2013 .

[11]  James Pustejovsky,et al.  On the Representation of Inferences and their Lexicalization , 2013 .

[12]  James J. Gibson,et al.  The Ecological Approach to Visual Perception: Classic Edition , 2014 .

[13]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[14]  Ali Farhadi,et al.  Situation Recognition: Visual Semantic Role Labeling for Image Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Changsong Liu,et al.  Grounded Semantic Role Labeling , 2016, NAACL.

[16]  Anthony G. Cohn,et al.  QSRlib: a software library for online acquisition of qualitative spatial relations from video , 2016 .

[17]  James Pustejovsky,et al.  VoxML: A Visualization Modeling Language , 2016, LREC.

[18]  Ernest Davis,et al.  The scope and limits of simulation in automated reasoning , 2016, Artif. Intell..

[19]  Demis Hassabis,et al.  Grounded Language Learning in a Simulated 3D World , 2017, ArXiv.

[20]  Stefan Lee,et al.  Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Kee-Eung Kim,et al.  Monte-Carlo Tree Search for Constrained POMDPs , 2018, NeurIPS.

[22]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Malek Mouhoub,et al.  Learning Qualitative Constraint Networks , 2018, TIME.

[24]  James Pustejovsky,et al.  Combining Deep Learning and Qualitative Spatial Reasoning to Learn Complex Structures from Sparse Examples with Noise , 2018, AAAI.