In this paper we present a bimodal speech recognition system in which the audio and visual modalities are modeled and integrated using coupled hidden Markov models (CHMMs). CHMMs are probabilistic inference graphs that have hidden Markov models as sub-graphs. Chains in the corresponding inference graph are coupled through matrices of conditional probabilities modeling temporal influences between their hidden state variables. The coupling probabilities are both cross chain and cross time. The later is essential for allowing temporal influences between chains, which is important in modeling bimodal speech. Our bimodal speech recognition system employs a two-chain CHMM, with one chain being associated with the acoustic observations, the other with the visual features. A deterministic approximation for maximum a posteriori (MAP) estimation is used to enable fast classification and parameter estimation. We evaluated the system on a speaker independent connected-digit task. Comparing with an acoustic-only ASR system trained using only the audio channel of the same database, the bimodal system consistently demonstrates improved noise robustness at all SNRs. We further compare the CHMM system reported in this paper with our earlier bimodal speech recognition system in which the two modalities are fused by concatenating the audio and visual features. The recognition results clearly show the advantages of the CHMM framework in the context of bimodal speech recognition.
[1]
Matthew Brand,et al.
Coupled hidden Markov models for modeling interacting processes
,
1997
.
[2]
Peter L. Silsbee,et al.
Audiovisual Sensory Integration Using Hidden Markov Models
,
1996
.
[3]
Eric D. Petajan.
Automatic lipreading to enhance speech recognition
,
1984
.
[4]
David G. Stork,et al.
Visionary Speech: Looking Ahead to Practical Speechreading Systems
,
1996
.
[5]
Thomas S. Huang,et al.
Bézier Volume Deformation Model for Facial Animation and Video Tracking
,
1998,
CAPTECH.
[6]
Gerasimos Potamianos,et al.
Discriminative training of HMM stream exponents for audio-visual speech recognition
,
1998,
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).
[7]
Alex Pentland,et al.
Coupled hidden Markov models for complex action recognition
,
1997,
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.