Placing Objects in Gesture Space: Toward Incremental Interpretation of Multimodal Spatial Descriptions

When describing routes not in the current environment, a common strategy is to anchor the description in configurations of salient landmarks, complementing the verbal descriptions by “placing” the non-visible landmarks in the gesture space. Understanding such multimodal descriptions and later locating the landmarks from real world is a challenging task for the hearer, who must interpret speech and gestures in parallel, fuse information from both modalities, build a mental representation of the description, and ground the knowledge to real world landmarks. In this paper, we model the hearer’s task, using a multimodal spatial description corpus we collected. To reduce the variability of verbal descriptions, we simplified the setup to use simple objects as landmarks. We describe a real-time system to evaluate the separate and joint contributions of the modalities. We show that gestures not only help to improve the overall system performance, even if to a large extent they encode redundant information, but also result in earlier final correct interpretations. Being able to build and apply representations incrementally will be of use in more dialogical settings, we argue, where it can enable immediate clarification in cases of mismatch.

[1]  David Schlangen,et al.  MINT.tools: tools and adaptors supporting acquisition, annotation and analysis of multimodal corpora , 2013, INTERSPEECH.

[2]  David Schlangen,et al.  The InproTK 2012 release , 2012, SDCTD@NAACL-HLT.

[3]  David Schlangen,et al.  Evaluation and Optimisation of Incremental Processors , 2011, Dialogue Discourse.

[4]  Holly A. Taylor,et al.  How do you get there from here? Mental representations of route descriptions , 1999 .

[5]  Luke S. Zettlemoyer,et al.  Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions , 2014, AAAI.

[6]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[7]  Deb K. Roy,et al.  Learning visually grounded words and syntax for a scene description task , 2002, Comput. Speech Lang..

[8]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[9]  David Schlangen,et al.  A simple generative model of incremental reference resolution for situated dialogue , 2017, Comput. Speech Lang..

[10]  Marjorie Skubic,et al.  Spatial language for human-robot dialogs , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[11]  Matthias Scheutz,et al.  Situated open world reference resolution for human-robot dialogue , 2016, 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[12]  Stefanie Tellex,et al.  Toward understanding natural language directions , 2010, HRI 2010.

[13]  Karen Emmorey,et al.  Using space to describe space: Perspective in speech, sign, and gesture , 2000, Spatial Cogn. Comput..

[14]  Matthew R. Walter,et al.  Learning spatial-semantic representations from natural language descriptions and scene classifications , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Stefan Kopp,et al.  Trading Spaces: How Humans and Humanoids Use Speech and Gesture to Give Directions , 2007 .

[16]  Stefanie Tellex,et al.  Incrementally Interpreting Multimodal Referring Expressions in Real Time , 2015 .

[17]  Felix Duvallet,et al.  Imitation learning for natural language direction following through unknown environments , 2013, 2013 IEEE International Conference on Robotics and Automation.

[18]  David Schlangen,et al.  Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information , 2013, SIGDIAL Conference.

[19]  Matthias Scheutz,et al.  Towards Situated Open World Reference Resolution , 2015, AAAI Fall Symposia.

[20]  C. Creider Hand and Mind: What Gestures Reveal about Thought , 1994 .

[21]  Julien Epps,et al.  Integration of Speech and Gesture Inputs during Multimodal Interaction , 2004 .

[22]  David Schlangen,et al.  Modelling Sub-Utterance Phenomena in Spoken Dialogue Systems , 2010 .

[23]  Matthew R. Walter,et al.  Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation , 2011, AAAI.

[24]  A. Kendon Gesticulation and Speech: Two Aspects of the Process of Utterance , 1981 .

[25]  David Whitney,et al.  Interpreting multimodal referring expressions in real time , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Matthew R. Walter,et al.  Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences , 2015, AAAI.

[27]  Ellen Campana,et al.  Real-time Integration Of Gesture And Speech During Reference Resolution , 2005 .

[28]  Martha W. Alibali,et al.  Gesture in Spatial Cognition: Expressing, Communicating, and Thinking About Spatial Information , 2005, Spatial Cogn. Comput..

[29]  Jean Oh,et al.  Inferring Maps and Behaviors from Natural Language Instructions , 2015, ISER.

[30]  Jean Oh,et al.  Learning Qualitative Spatial Relations for Robotic Navigation , 2016, IJCAI.