Contextual recognition of head gestures

Head pose and gesture offer several key conversational grounding cues and are used extensively in face-to-face interaction among people. We investigate how dialog context from an embodied conversational agent (ECA) can improve visual recognition of user gestures. We present a recognition framework which (1) extracts contextual features from an ECA's dialog manager, (2) computes a prediction of head nod and head shakes, and (3) integrates the contextual predictions with the visual observation of a vision-based head gesture recognizer. We found a subset of lexical, punctuation and timing features that are easily available in most ECA architectures and can be used to learn how to predict user feedback. Using a discriminative approach to contextual prediction and multi-modal integration, we were able to improve the performance of head gesture detection even when the topic of the test set was significantly different than the training set.

[1]  Kristinn R. Thórisson,et al.  The Power of a Nod and a Glance: Envelope Vs. Emotional Feedback in Animated Conversational Agents , 1999, Appl. Artif. Intell..

[2]  Ashish Kapoor,et al.  A real-time head nod and shake detector , 2001, PUI '01.

[3]  Candace L. Sidner,et al.  COLLAGEN: Applying Collaborative Discourse Theory to Human-Computer Interaction , 2001, AI Mag..

[4]  Catherine Pelachaud,et al.  Behavior Planning for a Reflexive Agent , 2001, IJCAI.

[5]  David R. Traum,et al.  Embodied agents for multi-party dialogue in immersive virtual worlds , 2002, AAMAS '02.

[6]  Rainer Stiefelhagen,et al.  Tracking focus of attention in meetings , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[7]  Rüdiger Dillmann,et al.  Human Friendly Programming of Humanoid Robots - The German Collaborative Research Center , 2002 .

[8]  Stanley Peters,et al.  Collaborative activities and multi-tasking in dialogue systems , 2002 .

[9]  Trevor Darrell,et al.  Fast stereo-based head tracking for interactive environments , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[10]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Trevor Darrell,et al.  A multi-modal approach for determining speaker location and focus , 2003, ICMI '03.

[12]  Yukiko I. Nakano,et al.  Towards a Model of Face-to-Face Grounding , 2003, ACL.

[13]  Trevor Darrell,et al.  Adaptive view-based appearance models , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[14]  Andrea Lockerd Thomaz,et al.  Teaching and working with robots as a collaboration , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[15]  Naoki Mukawa,et al.  Impact of video editing based on participants' gaze in multiparty conversation , 2004, CHI EA '04.

[16]  T. Kobayashi,et al.  A conversation robot using head gesture recognition as para-linguistic information , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[17]  Rüdiger Dillmann,et al.  Armar II - a Learning and Cooperative Multimodal Humanoid Robot System , 2004, Int. J. Humanoid Robotics.

[18]  Candace L. Sidner,et al.  Explorations in engagement for humans and robots , 2005, Artif. Intell..