3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

Multimodality is a key issue in robust humancomputer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR systems by exploring a real-time 3D lip tracking algorithm based on stereo vision and by investigating how learned statistical relationships between the sets of audio and video speech variables can be employed in AV ASR systems. The 3D lip tracking algorithm combines colour information from each cameras’ images with knowledge about the structure of the mouth region for different degrees of mouth openness. By using a calibrated stereo camera system, 3D coordinates of facial features can be recovered, so that the visual speech variable measurements become independent from the head pose. Multivariate statistical analyses enable the analysis of relationships between sets of variables. Co-inertia analysis is a relatively new method and has not yet been widely used in AVSP research. Its advantage is its superior numerical stability compared to other multivariate methods in the case of small sample size. Initial results are presented, which show how 3D video speech information and learned statistical relationships between audio and video speech variables can improve the performance of AV ASR systems.