论文信息 - 3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

Multimodality is a key issue in robust humancomputer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR systems by exploring a real-time 3D lip tracking algorithm based on stereo vision and by investigating how learned statistical relationships between the sets of audio and video speech variables can be employed in AV ASR systems. The 3D lip tracking algorithm combines colour information from each cameras’ images with knowledge about the structure of the mouth region for different degrees of mouth openness. By using a calibrated stereo camera system, 3D coordinates of facial features can be recovered, so that the visual speech variable measurements become independent from the head pose. Multivariate statistical analyses enable the analysis of relationships between sets of variables. Co-inertia analysis is a relatively new method and has not yet been widely used in AVSP research. Its advantage is its superior numerical stability compared to other multivariate methods in the case of small sample size. Initial results are presented, which show how 3D video speech information and learned statistical relationships between audio and video speech variables can improve the performance of AV ASR systems.

Roland Göcke

[1] Jean Thioulouse,et al. CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[2] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3] Alexander Zelinsky,et al. Validation of an automatic lip-tracking algorithm and design of a database for audio-video speech processing , 2000 .

[4] Alexander Zelinsky,et al. Real-time stereo tracking for head pose and gaze estimation , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[5] Roland Göcke,et al. The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[6] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7] Roland Göcke,et al. Statistical analysis of the relationship between audio and video speech parameters for Australian English , 2003, AVSP.

[8] S. Dolédec,et al. Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[9] K. Ruben Gabriel,et al. A permutation test of association between configurations by means of the rv coefficient , 1998 .

[10] Michael Wagner,et al. Aspects of speaking-face data corpus design methodology , 2004, INTERSPEECH.