论文信息 - Dynamic visual features based on discriminative speech class projection for visual speech recognition

Dynamic visual features based on discriminative speech class projection for visual speech recognition

This paper presents a dynamic visual feature extraction scheme to capture important lip motion information for visual speech recognition. Discriminative projections based on a-priori chosen speech classes, phonemes and visemes, are applied to the concatenation of pre-extracted static visual features. First- and second-order temporal derivatives are subsequently extracted to further represent the dynamic differences. Experiments on a connected digits task demonstrate that the proposed high discriminative dynamic features, when augmented to the static, yields superior recognition performance. Compared to the commonly used delta and acceleration features, the proposed dynamic feature leads to an 8% absolute improvement in terms of word accuracy for the considered recognition task.

Xie Lei | Zhao Rong-chun | Fu Zhong-hua | Cai Xiu-li

[1] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[2] Javier R. Movellan,et al. Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[3] Li Deng,et al. A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[4] N. L. Johnson,et al. Linear Statistical Inference and Its Applications , 1966 .

[5] Alexander H. Waibel,et al. See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[6] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..