Beyond attention: the role of deictic gesture in intention recognition in multimodal conversational interfaces

In a multimodal conversational interface supporting speech and deictic gesture, deictic gestures on the graphical display have been traditionally used to identify user attention, for example, through reference resolution. Since the context of the identified attention can potentially constrain the associated intention, our hypothesis is that deictic gestures can go beyond attention and apply to intention recognition. Driven by this assumption, this paper systematically investigates the role of deictic gestures in intention recognition. We experiment with different model-based methods and instancebased methods to incorporate gestural information for intention recognition. We examine the effects of utilizing gestural information in two different processing stages: speech recognition stage and language understanding stage. Our empirical results have shown that utilizing gestural information improves intention recognition. The performance is further improved when gestures are incorporated in both speech recognition and language understanding stages compared to either stage alone.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Michelle X. Zhou,et al.  A probabilistic approach to reference resolution in multimodal user interfaces , 2004, IUI '04.

[3]  Stephanie Seneff,et al.  Context-sensitive statistical language modeling , 2005, INTERSPEECH.

[4]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[5]  C. Y. Thielman,et al.  Natural Language with Integrated Deictic and Graphic Gestures , 1989, HLT.

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  Kyung-Whan Oh,et al.  Intention Recognition using a Graph Representation , 2007 .

[8]  Joyce Yue Chai,et al.  Cognitive Principles in Robust Multimodal Interpretation , 2006, J. Artif. Intell. Res..

[9]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[10]  Oliver Lemon,et al.  Combining Acoustic and Pragmatic Features to Predict Recognition Performance in Spoken Dialogue Systems , 2004, ACL.

[11]  Alexander I. Rudnicky,et al.  N-best speech hypotheses reordering using linear regression , 2001, INTERSPEECH.

[12]  Jens Edlund,et al.  Adapt - a multimodal conversational dialogue system in an apartment domain , 2000, INTERSPEECH.

[13]  Deb Roy,et al.  Towards situated speech understanding: visual context priming of language models , 2005, Comput. Speech Lang..

[14]  Sharon Oviatt,et al.  Multimodal interactive maps: designing for human performance , 1997 .

[15]  Oliver Lemon,et al.  multithreaded context for robust conversational interfaces: Context-sensitive speech recognition and interpretation of corrective fragments , 2004, TCHI.

[16]  Christiaan J. J. Paredis,et al.  Intention aware interactive multi-modal robot programming , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[17]  Joyce Yue Chai,et al.  A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems , 2005, HLT.

[18]  Eric Fosler-Lussier,et al.  Adaptive language models for spoken dialogue systems , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Christoph Schlieder,et al.  Exploring Context-Sensitivity in Spatial Intention Recognition , 2007, BMI.

[20]  Carlo Strapparava,et al.  Multimodal Interaction for Information Access: Exploiting Cohesion , 1997, Comput. Intell..

[21]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[22]  Karim A. Tahboub,et al.  Journal of Intelligent and Robotic Systems (2005) DOI: 10.1007/s10846-005-9018-0 Intelligent Human–Machine Interaction Based on Dynamic Bayesian Networks Probabilistic Intention Recognition , 2004 .

[23]  Joyce Yue Chai,et al.  Salience modeling based on non-verbal modalities for spoken language understanding , 2006, ICMI '06.

[24]  Marilyn A. Walker,et al.  MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[25]  Sharon L. Oviatt,et al.  Mulitmodal Interactive Maps: Designing for Human Performance , 1997, Hum. Comput. Interact..

[26]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[27]  Roger K. Moore Computer Speech and Language , 1986 .

[28]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[29]  Carla Huls,et al.  Automatic Referent Resolution of Deictic and Anaphoric Expressions , 1995, CL.

[31]  Shimei Pan,et al.  Mind: A Context-Based Multimodal Interpretation Framework in Conversational Systems , 2005 .

[32]  Jacob Eisenstein,et al.  A Salience-Based Approach to Gesture-Speech Alignment , 2004, HLT-NAACL.

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  Shumin Zhai,et al.  Conversing with the user based on eye-gaze patterns , 2005, CHI.

[35]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[36]  Andrew Kehler,et al.  Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction , 2000, AAAI/IAAI.

[37]  Zunaid Kazi,et al.  Multimodal HCI for Robot Control: Towards an Intelligent Robotic Assistant for People with Disabilities , 1996 .