The fusion of audio and visual speech is an instance of the general sensory fusion problem. The sensory fusion problem arises in the situation when multiple channels carry complementary information about different components of a system. In the case of audio-visual speech, the two modalities manifest two aspects of the same underlying speech production process. From an observer's view, the audio channel and the visual channel represent two interacting stochastic processes. We seek a framework that can model the two individual processes as well as their dynamic interactions. One interesting aspect of audio-visual speech is the inherent asynchrony between the audio and visual channels. Most early integration approaches to the fusion problem assume tight synchrony between the two. However, studies have shown that human perception of bimodal speech does not require rigid synchronization of the two modalities. Furthermore, humans appear to use the audio-visual asynchronies as multimodal features. For example, it is well known that the voice onset time is an important cue to the voicing feature in stop consonants. This information can be conveyed bimodally by the interval between seeing the stop release and hearing the vocal cord vibration. Therefore, a successful fusion scheme should not only be tolerant to asynchrony between the audio and visual cues, but also be apt to capture and exploit this bimodal feature.
[1]
Juergen Luettin,et al.
Audio-Visual Speech Modeling for Continuous Speech Recognition
,
2000,
IEEE Trans. Multim..
[2]
David J. C. Mackay,et al.
Introduction to Monte Carlo Methods
,
1998,
Learning in Graphical Models.
[3]
Zoubin Ghahramani,et al.
Learning Dynamic Bayesian Networks
,
1997,
Summer School on Neural Networks.
[4]
Michael I. Jordan,et al.
An Introduction to Variational Methods for Graphical Models
,
1999,
Machine Learning.
[5]
Tsuhan Chen,et al.
Audiovisual speech processing
,
2001,
IEEE Signal Process. Mag..