Guiding Interaction Behaviors for Multi-modal Grounded Language Learning

Multi-modal grounded language learning connects language predicates to physical properties of objects in the world. Sensing with multiple modalities, such as audio, haptics, and visual colors and shapes while performing interaction behaviors like lifting, dropping, and looking on objects enables a robot to ground non-visual predicates like “empty” as well as visual predicates like “red”. Previous work has established that grounding in multi-modal space improves performance on object retrieval from human descriptions. In this work, we gather behavior annotations from humans and demonstrate that these improve language grounding performance by allowing a system to focus on relevant behaviors for words like “white” or “half-full” that can be understood by looking or lifting, respectively. We also explore adding modality annotations (whether to focus on audio or haptics when performing a behavior), which improves performance, and sharing information between linguistically related predicates (if “green” is a color, “white” is a color), which improves grounding recall but at the cost of precision.

[1]  Dieter Fox,et al.  Attribute based object identification , 2013, 2013 IEEE International Conference on Robotics and Automation.

[2]  Anthony G. Cohn,et al.  Natural Language Acquisition and Grounding for Embodied Robotic Systems , 2017, AAAI.

[3]  Peter Stone,et al.  Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" , 2016, IJCAI.

[4]  Dermot Lynott,et al.  Modality exclusivity norms for 423 object properties , 2009, Behavior research methods.

[5]  Connor Schenck,et al.  Learning relational object categories using behavioral exploration and multimodal perception , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Rodney D. Nielsen,et al.  Grounding the Meaning of Words through Vision and Interactive Gameplay , 2015, IJCAI.

[7]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[8]  Changsong Liu,et al.  Probabilistic Labeling for Efficient Referential Grounding based on Collaborative Discourse , 2014, ACL.

[9]  Haris Dindo,et al.  A probabilistic approach to learning a visually grounded language model through human-robot interaction , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Peter Stone,et al.  Learning to Order Objects Using Haptic and Proprioceptive Exploratory Behaviors , 2016, IJCAI.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Stephen Clark,et al.  Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception , 2015, EMNLP.

[13]  Daniel Jurafsky,et al.  Eye Spy: Improving Vision through Dialog , 2010, AAAI Fall Symposium: Dialog with Robots.

[14]  John E. Laird,et al.  Towards an Indexical Model of Situated Language Comprehension for Real-World Cognitive Agents , 2013 .

[15]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[16]  Raymond J. Mooney,et al.  Integrated Learning of Dialog Strategies and Semantic Parsing , 2017, EACL.