Inferring body pose using speech content

Untethered multimodal interfaces are more attractive than tethered ones because they are more natural and expressive for interaction. Such interfaces usually require robust vision-based body pose estimation and gesture recognition. In interfaces where a user is interacting with a computer using speech and arm gestures, the user's spoken keywords can be recognized in conjuction with a hypothesis of body poses. This co-occurence can reduce the number of body pose hypothesis for the vision based tracker. In this paper we show that incorporating speech-based body pose constraints can increase the robustness and accuracy of vision-based tracking systems.Next, we describe an approach for gesture recognition. We show how Linear Discriminant Analysis (LDA), can be employed to estimate 'good features' that can be used in a standard HMM-based gesture recognition system. We show that, by applying our LDA scheme, recognition errors can be significantly reduced over a standard HMM-based technique.We applied both techniques in a Virtual Home Desktop scenario. Experiments where the users controlled a desktop system using gestures and speech were conducted and the results show that the speech recognised in conjunction with body poses has increased the accuracy of the vision-based tracking system.

[1]  Olivier D. Faugeras,et al.  3D articulated models and multi-view tracking with silhouettes , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[3]  Rainer Stiefelhagen,et al.  Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures , 2004, ICMI '04.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Victor Zue,et al.  GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[6]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[7]  Nebojsa Jojic,et al.  Tracking articulated objects in dense disparity maps , 1999 .

[8]  Rajeev Sharma,et al.  Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[9]  Gregory D. Hager,et al.  Building a task language for segmentation and recognition of user input to cooperative manipulation systems , 2002, Proceedings 10th Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. HAPTICS 2002.

[10]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[11]  Kristinn R. Thórisson,et al.  Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures , 1991, AAAI Workshop on Intelligent Multimedia Interfaces.

[12]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[13]  Alexander H. Waibel,et al.  Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[14]  Steven K. Feiner,et al.  Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality , 2003, ICMI '03.

[15]  Trevor Darrell,et al.  3-D articulated pose tracking for untethered diectic reference , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[16]  Antonella De Angeli,et al.  Integration and Synchronization of Input Modes during Multimodal Human-Computer Interaction , 1997, CHI.

[17]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[18]  D. McNeill Hand and Mind , 1995 .

[19]  Philip R. Cohen,et al.  A map-based system using speech and 3D gestures for pervasive computing , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.