Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions

The goal of this article is to enable robots to perform robust task execution following human instructions in partially observable environments. A robot’s ability to interpret and execute commands is fundamentally tied to its semantic world knowledge. Commonly, robots use exteroceptive sensors, such as cameras or LiDAR, to detect entities in the workspace and infer their visual properties and spatial relationships. However, semantic world properties are often visually imperceptible. We posit the use of non-exteroceptive modalities including physical proprioception, factual descriptions, and domain knowledge as mechanisms for inferring semantic properties of objects. We introduce a probabilistic model that fuses linguistic knowledge with visual and haptic observations into a cumulative belief over latent world attributes to infer the meaning of instructions and execute the instructed tasks in a manner robust to erroneous, noisy, or contradictory evidence. In addition, we provide a method that allows the robot to communicate knowledge dissonance back to the human as a means of correcting errors in the operator’s world model. Finally, we propose an efficient framework that anticipates possible linguistic interactions and infers the associated groundings for the current world state, thereby bootstrapping both language understanding and generation. We present experiments on manipulators for tasks that require inference over partially observed semantic properties, and evaluate our framework’s ability to exploit expressed information and knowledge bases to facilitate convergence, and generate statements to correct declared facts that were observed to be inconsistent with the robot’s estimate of object properties.

[1]  Hokeun Kim,et al.  Multimodal anomaly detection for assistive robots , 2019, Auton. Robots.

[2]  Mohit Shridhar,et al.  Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction , 2018, Robotics: Science and Systems.

[3]  Ivana Kruijff-Korbayová,et al.  Situated resolution and generation of spatial referring expressions for robotic assistants , 2009, IJCAI 2009.

[4]  Regina A. Pomranky,et al.  The role of trust in automation reliance , 2003, Int. J. Hum. Comput. Stud..

[5]  Gregory Shakhnarovich,et al.  Comprehension-Guided Referring Expressions , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Stefanie Tellex,et al.  Clarifying commands with information-theoretic human-robot dialog , 2013, HRI 2013.

[7]  Nicholas Roy,et al.  Real-Time Human-Robot Communication for Manipulation Tasks in Partially Observed Environments , 2018, ISER.

[8]  Luke S. Zettlemoyer,et al.  Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions , 2014, AAAI.

[9]  Luke Fletcher,et al.  A Situationally Aware Voice‐commandable Robotic Forklift Working Alongside People in Unstructured Outdoor Environments , 2015, J. Field Robotics.

[10]  Matthew R. Walter,et al.  Efficient Natural Language Interfaces for Assistive Robots , 2014, IROS 2014.

[11]  Dimitra Gkatzia,et al.  Generating and Evaluating Landmark-Based Navigation Instructions in Virtual Environments , 2015, ENLG.

[12]  Devi Parikh,et al.  Attributes for Classifier Feedback , 2012, ECCV.

[13]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[14]  Li Guo,et al.  Knowledge Base Completion Using Embeddings and Rules , 2015, IJCAI.

[15]  Matthew R. Walter,et al.  A Multiview Approach to Learning Articulated Motion Models , 2017, ISRR.

[16]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Yu Zhang,et al.  Temporal Spatial Inverse Semantics for Robots Communicating with Humans , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[20]  Wolfram Burgard,et al.  Tactile Sensing for Mobile Manipulation , 2011, IEEE Transactions on Robotics.

[21]  Raymond J. Mooney,et al.  Generative Alignment and Semantic Parsing for Learning from Ambiguous Supervision , 2010, COLING.

[22]  Jayant Krishnamurthy,et al.  Toward Interactive Grounded Language Acqusition , 2013, Robotics: Science and Systems.

[23]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Dieter Fox,et al.  Following directions using statistical machine translation , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[25]  James M. Rehg,et al.  Rapid categorization of object properties from incidental contact with a tactile sensing robot arm , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[26]  Nicholas Roy,et al.  Learning Unknown Groundings for Natural Language Interaction with Mobile Robots , 2017, ISRR.

[27]  Matthew R. Walter,et al.  Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation , 2016, 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI.

[28]  Mirella Lapata,et al.  Unsupervised Concept-to-text Generation with Hypergraphs , 2012, NAACL.

[29]  Danica Kragic,et al.  Interactive object classification using sensorimotor contingencies , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Rudolph Triebel,et al.  Driven Learning for Driving: How Introspection Improves Semantic Mapping , 2016, ISRR.

[31]  Yejin Choi,et al.  Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.

[32]  Nicholas Roy,et al.  Efficient grounding of abstract spatial concepts for natural language interaction with robot platforms , 2018, Int. J. Robotics Res..

[33]  Dan Klein,et al.  A Simple Domain-Independent Probabilistic Approach to Generation , 2010, EMNLP.

[34]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Joyce Yue Chai,et al.  Embodied Collaborative Referring Expression Generation in Situated Human-Robot Interaction , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[36]  Rudolph Triebel,et al.  Knowing when we don't know: Introspective classification for mission-critical decision making , 2013, 2013 IEEE International Conference on Robotics and Automation.

[37]  Nicholas Roy,et al.  Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context , 2017, IJCAI.

[38]  Kevin Lee,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[39]  Matthew R. Walter,et al.  Approaching the Symbol Grounding Problem with Probabilistic Graphical Models , 2011, AI Mag..

[40]  Jacob Arkin,et al.  Experiments in Proactive Symbol Grounding for Efficient Physically Situated Human-Robot Dialogue , 2018 .

[41]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[42]  Matthew R. Walter,et al.  Information-theoretic dialog to improve spatial-semantic representations , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[43]  Felix Duvallet,et al.  Imitation learning for natural language direction following through unknown environments , 2013, 2013 IEEE International Conference on Robotics and Automation.

[44]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[45]  Manuela M. Veloso,et al.  Learning environmental knowledge from task-based human-robot dialog , 2013, 2013 IEEE International Conference on Robotics and Automation.

[46]  Matthew R. Walter,et al.  Learning models for following natural language directions in unknown environments , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[48]  Joyce Yue Chai,et al.  Interactive Learning of Grounded Verb Semantics towards Human-Robot Communication , 2017, ACL.

[49]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[50]  Stefanie Tellex,et al.  A natural language planner interface for mobile manipulators , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[51]  Jivko Sinapov,et al.  From Acoustic Object Recognition to Object Categorization by a Humanoid Robot , 2009 .

[52]  James F. Allen,et al.  SALL-E: Situated Agent for Language Learning , 2013, AAAI.

[53]  Trevor Darrell,et al.  Robotic learning of haptic adjectives through physical interaction , 2015, Robotics Auton. Syst..

[54]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[55]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[56]  Jean Oh,et al.  Inferring Maps and Behaviors from Natural Language Instructions , 2015, ISER.

[57]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[58]  Stefanie Tellex,et al.  Toward Information Theoretic Human-Robot Dialog , 2012, Robotics: Science and Systems.

[59]  Rudolph Triebel,et al.  Introspective classification for robot perception , 2016, Int. J. Robotics Res..

[60]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[61]  Ning Wang,et al.  Trust calibration within a human-robot team: Comparing automatically generated explanations , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[62]  Sean Andrist,et al.  Rhetorical robots: Making robots more effective speakers using linguistic cues of expertise , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[63]  Mirella Lapata,et al.  Collective Content Selection for Concept-to-Text Generation , 2005, HLT.

[64]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[65]  GlassJames,et al.  A Situationally Aware Voice-commandable Robotic Forklift Working Alongside People in Unstructured Outdoor Environments , 2015 .

[66]  C. Lawrence Zitnick,et al.  Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[67]  Peter Stone,et al.  Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" , 2016, IJCAI.

[68]  Nina Dethlefs,et al.  Generating Adaptive Route Instructions Using Hierarchical Reinforcement Learning , 2010, Spatial Cognition.

[69]  Matthew R. Walter,et al.  What to talk about and how? Selective Generation using LSTMs with Coarse-to-Fine Alignment , 2015, NAACL.

[70]  Ali Farhadi,et al.  Stating the Obvious: Extracting Visual Common Sense Knowledge , 2016, NAACL.

[71]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[72]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[73]  Peter Stone,et al.  Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions , 2018, AAAI.

[74]  Danqi Chen,et al.  Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[75]  Wolfram Burgard,et al.  Learning to give route directions from human demonstrations , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[76]  Marilyn A. Walker,et al.  SPoT: A Trainable Sentence Planner , 2001, NAACL.

[77]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[78]  Leslie Pack Kaelbling,et al.  Logical Particle Filtering , 2007, Probabilistic, Logical and Relational Learning - A Further Synthesis.

[79]  Ross A. Knepper,et al.  Asking for Help Using Inverse Semantics , 2014, Robotics: Science and Systems.

[80]  Matthew R. Walter,et al.  A framework for learning semantic maps from grounded natural language descriptions , 2014, Int. J. Robotics Res..

[81]  Connor Schenck,et al.  Grounding semantic categories in behavioral interactions: Experiments with 100 objects , 2014, Robotics Auton. Syst..

[82]  Luke Zettlemoyer,et al.  Learning to Parse Natural Language to a Robot Execution System , 2012 .

[83]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[84]  Hadas Kress-Gazit,et al.  Sorry Dave, I'm Afraid I Can't Do That: Explaining Unachievable Robot Tasks Using Natural Language , 2013, Robotics: Science and Systems.

[85]  Stephen G Pauker,et al.  Probability Distributions , 2013, Medical decision making : an international journal of the Society for Medical Decision Making.

[86]  David Whitney,et al.  Interpreting multimodal referring expressions in real time , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[87]  Edwin Olson,et al.  DART: A particle-based method for generating easy-to-follow directions , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[88]  John D. Kelleher,et al.  Incremental Generation of Spatial Referring Expressions in Situated Dialog , 2006, ACL.

[89]  Sebastian Thrun,et al.  Probabilistic robotics , 2002, CACM.

[90]  Matthew R. Walter,et al.  Learning Semantic Maps from Natural Language Descriptions , 2013, Robotics: Science and Systems.

[91]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[92]  Peter Stone,et al.  Learning to Interpret Natural Language Commands through Human-Robot Dialog , 2015, IJCAI.

[93]  Yejin Choi,et al.  Event2Mind: Commonsense Inference on Events, Intents, and Reactions , 2018, ACL.

[94]  Raymond J. Mooney,et al.  Generation by Inverting a Semantic Parser that Uses Statistical Machine Translation , 2007, NAACL.

[95]  Regina Barzilay,et al.  Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization , 2004, NAACL.

[96]  Martial Hebert,et al.  Integrated Intelligence for Human-Robot Teams , 2016, ISER.

[97]  Matthew R. Walter,et al.  A multimodal interface for real-time soldier-robot teaming , 2016, SPIE Defense + Security.