论文信息 - Inferring body pose using speech content

Inferring body pose using speech content

Untethered multimodal interfaces are more attractive than tethered ones because they are more natural and expressive for interaction. Such interfaces usually require robust vision-based body pose estimation and gesture recognition. In interfaces where a user is interacting with a computer using speech and arm gestures, the user's spoken keywords can be recognized in conjuction with a hypothesis of body poses. This co-occurence can reduce the number of body pose hypothesis for the vision based tracker. In this paper we show that incorporating speech-based body pose constraints can increase the robustness and accuracy of vision-based tracking systems.Next, we describe an approach for gesture recognition. We show how Linear Discriminant Analysis (LDA), can be employed to estimate 'good features' that can be used in a standard HMM-based gesture recognition system. We show that, by applying our LDA scheme, recognition errors can be significantly reduced over a standard HMM-based technique.We applied both techniques in a Virtual Home Desktop scenario. Experiments where the users controlled a desktop system using gestures and speech were conducted and the results show that the speech recognised in conjunction with body poses has increased the accuracy of the vision-based tracking system.

David Demirdjian | Sy Bor Wang

[1] Olivier D. Faugeras,et al. 3D articulated models and multi-view tracking with silhouettes , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[2] Richard A. Bolt,et al. “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[3] Rainer Stiefelhagen,et al. Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3D pointing gestures , 2004, ICMI '04.

[4] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5] Victor Zue,et al. GALAXY-II: a reference architecture for conversational system development , 1998, ICSLP.

[6] Alex Acero,et al. Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[7] Nebojsa Jojic,et al. Tracking articulated objects in dense disparity maps , 1999 .

[8] Rajeev Sharma,et al. Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[9] Gregory D. Hager,et al. Building a task language for segmentation and recognition of user input to cooperative manipulation systems , 2002, Proceedings 10th Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. HAPTICS 2002.

[10] Sharon L. Oviatt,et al. Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[11] Kristinn R. Thórisson,et al. Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures , 1991, AAAI Workshop on Intelligent Multimedia Interfaces.

[12] Sharon L. Oviatt,et al. Ten myths of multimodal interaction , 1999, Commun. ACM.

[13] Alexander H. Waibel,et al. Natural human-robot interaction using speech, head pose and gestures , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[14] Steven K. Feiner,et al. Mutual disambiguation of 3D multimodal interaction in augmented and virtual reality , 2003, ICMI '03.

[15] Trevor Darrell,et al. 3-D articulated pose tracking for untethered diectic reference , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[16] Antonella De Angeli,et al. Integration and Synchronization of Input Modes during Multimodal Human-Computer Interaction , 1997, CHI.

[17] Alex Acero,et al. Spoken Language Processing , 2001 .

[18] D. McNeill. Hand and Mind , 1995 .

[19] Philip R. Cohen,et al. A map-based system using speech and 3D gestures for pervasive computing , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.