Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions

It is important for a robot to be able to interpret natural language commands given by a human. In this paper, we consider performing a sequence of mobile manipulation tasks with instructions described in natural language. Given a new environment, even a simple task such as boiling water would be performed quite differently depending on the presence, location and state of the objects. We start by collecting a dataset of task descriptions in free-form natural language and the corresponding grounded task-logs of the tasks performed in an online robot simulator. We then build a library of verb–environment instructions that represents the possible instructions for each verb in that environment, these may or may not be valid for a different environment and task context. We present a model that takes into account the variations in natural language and ambiguities in grounding them to robotic instructions with appropriate environment context and task constraints. Our model also handles incomplete or noisy natural language instructions. It is based on an energy function that encodes such properties in a form isomorphic to a conditional random field. We evaluate our model on tasks given in a robotic simulator and show that it successfully outperforms the state of the art with 61.8% accuracy. We also demonstrate a grounded robotic instruction sequence on a PR2 robot using the Learning from Demonstration approach.

[1]  Trevor Darrell,et al.  Using robotic exploratory procedures to learn the meaning of haptic adjectives , 2013, 2013 IEEE International Conference on Robotics and Automation.

[2]  Fei-Fei Li,et al.  Video Event Understanding Using Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Mirella Lapata,et al.  Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , 1999, ACL 1999.

[4]  Dejan Pangercic,et al.  Robotic roommates making pancakes , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[5]  Matei T. Ciocarlie,et al.  ROS commander (ROSCo): Behavior creation for home robots , 2013, 2013 IEEE International Conference on Robotics and Automation.

[6]  Maya Cakmak,et al.  Towards grounding concepts for transfer in goal learning from demonstration , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[7]  Ron Alterovitz,et al.  Rapidly-exploring roadmaps: Weighing exploration vs. refinement in optimal motion planning , 2011, 2011 IEEE International Conference on Robotics and Automation.

[8]  Joel Nothman,et al.  Event Linking: Grounding Event Reference in a News Archive , 2012, ACL.

[9]  Matei T. Ciocarlie,et al.  Contact-reactive grasping of objects with partial shape information , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Bart Selman,et al.  Learning Sequences of Controllers for Complex Manipulation Tasks , 2013, ArXiv.

[11]  Leslie Pack Kaelbling,et al.  Manipulation with Multiple Action Types , 2012, ISER.

[12]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[13]  Siddhartha S. Srinivasa,et al.  CHOMP: Gradient optimization techniques for efficient motion planning , 2009, 2009 IEEE International Conference on Robotics and Automation.

[14]  Moritz Tenorth,et al.  RoboEarth Action Recipe Execution , 2012, IAS.

[15]  Mark Steedman,et al.  Learning STRIPS Operators from Noisy and Incomplete Observations , 2012, UAI.

[16]  Maya Cakmak,et al.  Keyframe-based Learning from Demonstration , 2012, Int. J. Soc. Robotics.

[17]  Luke S. Zettlemoyer,et al.  Learning to Parse Natural Language Commands to a Robot Control System , 2012, ISER.

[18]  Ross A. Knepper,et al.  Assembling Furniture by Asking for Help from a Human Partner , 2010 .

[19]  Manuel Lopes,et al.  Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[20]  Thorsten Joachims,et al.  Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[21]  Earl J. Wagner,et al.  Cooking with Semantics , 2014, ACL 2014.

[22]  Oliver Brock,et al.  Extracting kinematic background knowledge from interactions using task-sensitive relational learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Hadas Kress-Gazit,et al.  From structured english to robot motion , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[25]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Oliver Kroemer,et al.  Combining active learning and reactive control for robot grasping , 2010, Robotics Auton. Syst..

[27]  Raymond J. Mooney,et al.  Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language , 2014, J. Artif. Intell. Res..

[28]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[29]  Ali Farhadi,et al.  Attribute-centric recognition for cross-category generalization , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Rachid Alami,et al.  Which one? Grounding the referent based on efficient human-robot interaction , 2010, 19th International Symposium in Robot and Human Interactive Communication.

[31]  Jennifer Barry,et al.  Bakebot: Baking Cookies with the PR2 , 2011 .

[32]  Luke S. Zettlemoyer,et al.  Reading between the Lines: Learning to Map High-Level Instructions to Commands , 2010, ACL.

[33]  Scott Niekum,et al.  Incremental Semantically Grounded Learning from Demonstration , 2013, Robotics: Science and Systems.

[34]  Hadas Kress-Gazit,et al.  LTLMoP: Experimenting with language, Temporal Logic and robot control , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[36]  Matthew R. Walter,et al.  Learning Semantic Maps from Natural Language Descriptions , 2013, Robotics: Science and Systems.

[37]  Moritz Tenorth,et al.  KNOWROB-MAP - knowledge-linked semantic object maps , 2010, 2010 10th IEEE-RAS International Conference on Humanoid Robots.

[38]  Luke S. Zettlemoyer,et al.  Online Learning of Relaxed CCG Grammars for Parsing to Logical Form , 2007, EMNLP.

[39]  Geoffrey A. Hollinger,et al.  HERB: a home exploring robotic butler , 2010, Auton. Robots.

[40]  Wolfram Burgard,et al.  Learning the dynamics of doors for robotic manipulation , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[41]  Luke S. Zettlemoyer,et al.  Context-dependent Semantic Parsing for Time Expressions , 2014, ACL.

[42]  Ufuk Topcu,et al.  Receding horizon control for temporal logic specifications , 2010, HSCC '10.

[43]  Maya Cakmak,et al.  To Afford or Not to Afford: A New Formalization of Affordances Toward Affordance-Based Robot Control , 2007, Adapt. Behav..

[44]  Manuel Lopes,et al.  Active Learning for Teaching a Robot Grounded Relational Symbols , 2013, IJCAI.

[45]  John Folkesson,et al.  Search in the real world: Active visual object search based on spatial relations , 2011, 2011 IEEE International Conference on Robotics and Automation.

[46]  Mark Steedman,et al.  The syntactic process , 2004, Language, speech, and communication.

[47]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[48]  Michael Beetz,et al.  Grounding the Interaction: Anchoring Situated Discourse in Everyday Human-Robot Interaction , 2012, Int. J. Soc. Robotics.

[49]  Leslie Pack Kaelbling,et al.  Hierarchical task and motion planning in the now , 2011, 2011 IEEE International Conference on Robotics and Automation.

[50]  Jean Oh,et al.  Inferring Maps and Behaviors from Natural Language Instructions , 2015, ISER.

[51]  Hema Swetha Koppula,et al.  RoboBrain: Large-Scale Knowledge Engine for Robots , 2014, ArXiv.

[52]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[53]  Trevor Darrell,et al.  Open-vocabulary Object Retrieval , 2014, Robotics: Science and Systems.

[54]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[55]  K. Fernow New York , 1896, American Potato Journal.

[56]  Maja J. Mataric,et al.  Using semantic fields to model dynamic spatial relations in a robot architecture for natural language instruction of service robots , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[57]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[58]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[60]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[61]  Thorsten Joachims,et al.  Contextually Guided Semantic Labeling and Search for 3D Point Clouds , 2011, ArXiv.

[62]  Mark Steedman,et al.  Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification , 2010, EMNLP.

[63]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[64]  Mehmet R. Doùgar Affordances as a Framework for Robot Control , 2007 .

[65]  Stefanie Tellex,et al.  Interpreting and Executing Recipes with a Cooking Robot , 2012, ISER.

[66]  Michael Beetz,et al.  Acquiring task models for imitation learning through games with a purpose , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[67]  Hoifung Poon,et al.  Grounded Unsupervised Semantic Parsing , 2013, ACL.

[68]  Mark Steedman,et al.  Surface structure and interpretation , 1996, Linguistic inquiry.

[69]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[70]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[71]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[72]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[73]  Stefanie Tellex,et al.  Grounding Verbs of Motion in Natural Language Commands to Robots , 2010, ISER.

[74]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[75]  Jussi Rintanen,et al.  Planning as satisfiability: Heuristics , 2012, Artif. Intell..

[76]  Bart Selman,et al.  Synthesizing manipulation sequences for under-specified tasks using unrolled Markov Random Fields , 2013, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.