The task of word spotting is to detect and verify some specific words embedded in unconstrained speech. Most word spotters based on hidden Markov models (HMMs) have the same noise robustness problem as a speech recognizer. The performance of a word spotter drops significantly under a noisy environment. Visual speech information has been shown to improve noise robustness of speech recognizers (Neti, C. et al., 2000; Nefian, A.V. et al., 2002; Potamianos, G. et al., 2003). We add visual speech information to improve the noise robustness of the word spotter. In visual frontend processing, the information-based maximum discrimination (IBMD) algorithm (Colmenarez, A. and Huang, T.S., 1997) is used to detect the face/mouth corners. In audio-visual fusion, feature-level fusion is adopted. We compare the audio-visual word-spotter with the audio-only spotter and show the advantage of the former approach over the latter.
[1]
Kevin P. Murphy,et al.
A coupled HMM for audio-visual speech recognition
,
2002,
2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[2]
Thomas S. Huang,et al.
Face detection with information-based maximum discrimination
,
1997,
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[3]
Chalapathy Neti,et al.
Recent advances in the automatic recognition of audiovisual speech
,
2003,
Proc. IEEE.
[4]
Thomas S. Huang,et al.
Maximum likelihood face detection
,
1996,
Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.
[5]
Yoni Bauduin,et al.
Audio-Visual Speech Recognition
,
2004
.