Improving continuous gesture recognition with spoken prosody

Despite recent advances in gesture recognition, reliance on the visual signal alone to classify unrestricted continuous gesticulation is inherently error-prone. Since spontaneous gesticulation is mostly coverbal in nature, there have been some attempts of using speech cues to improve gesture recognition. Some attempts have been made in using speech cues to improve gesture recognition, e.g., keyword-gesture co-analysis. Use of such scheme is burdened by the complexity of natural language understanding. This paper offers a "signal-level" perspective by exploring prosodic phenomena of spontaneous gesture and speech co-production. We present a computational framework for improving continuous gesture recognition based on two phenomena that capture voluntary (co-articulation) and involuntary (physiological) contributions of prosodic synchronization. Physiological constraints, manifested as signal interruptions in multimodal production, are exploited in an audio-visual feature integration framework using hidden Markov models (HMMs). Co-articulation is analyzed using a Bayesian network of naive classifiers to explore alignment of intonationally prominent speech segments and hand kinematics. The efficacy of the proposed approach was demonstrated on a multimodal corpus created from the Weather Channel broadcast. Both schemas were found to contribute uniquely by reducing different error types, which subsequently improves the performance of continuous gesture recognition.

[1]  Kirsti Grobel,et al.  Video-Based Sign Language Recognition Using Hidden Markov Models , 1997, Gesture Workshop.

[2]  Rajeev Sharma,et al.  Exploiting speech/gesture co-occurrence for improving continuous gesture recognition in weather narration , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[3]  Paul Taylor,et al.  The rise/fall/connection model of intonation , 1994, Speech Communication.

[4]  Rajeev Sharma,et al.  Reliable tracking of human arm dynamics by multiple cue integration and constraint fusion , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[5]  A. Kendon Gesticulation and Speech: Two Aspects of the Process of Utterance , 1981 .

[6]  Jin-Hyung Kim,et al.  An HMM-Based Threshold Model Approach for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Günter Hommel,et al.  Velocity Profile Based Recognition of Dynamic Gestures with Discrete Hidden Markov Models , 1997, Gesture Workshop.

[8]  Gerhard Rigoll,et al.  High Performance Real-Time Gesture Recognition Using Hidden Markov Models , 1997, Gesture Workshop.

[9]  David Maxwell Chickering,et al.  A Bayesian Approach to Learning Bayesian Networks with Local Structure , 1997, UAI.

[10]  Rajeev Sharma,et al.  Toward Natural Gesture/Speech Control of a Large Display , 2001, EHCI.

[11]  Aaron F. Bobick,et al.  A State-Based Approach to the Representation and Recognition of Gesture , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Rajeev Sharma,et al.  Understanding Gestures in Multimodal Human Computer Interaction , 2000, Int. J. Artif. Intell. Tools.

[13]  Aaron F. Bobick,et al.  Hidden Markov Models for Modeling and Recognizing Gesture Under Variation , 2001, Int. J. Pattern Recognit. Artif. Intell..

[14]  Mohammed Yeasin,et al.  Prosody based co-analysis for continuous recognition of coverbal gestures , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[15]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Sotaro Kita,et al.  Movement Phase in Signs and Co-Speech Gestures, and Their Transcriptions by Human Coders , 1997, Gesture Workshop.