Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition

Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.

[1]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Jia Pei Active Appearance Model , 2010 .

[3]  Li Deng,et al.  Evaluation of SPLICE on the Aurora 2 and 3 tasks , 2002, INTERSPEECH.

[4]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[5]  V. Kshirsagar,et al.  Face recognition using Eigenfaces , 2011, 2011 3rd International Conference on Computer Research and Development.

[6]  Md. Tariquzzaman,et al.  Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion , 2011, 2011 International Conference on Internet Computing and Information Services.

[7]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[8]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[9]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Masato Ishikawa,et al.  A robust audio-visual speech recognition using audio-visual voice activity detection , 2010, INTERSPEECH.

[11]  Veronique Stouten,et al.  Robust Automatic Speech Recognition in Time-Varying Environments (Robuuste automatische spraakherkenning in een tijdsvariërende omgeving) , 2006 .

[12]  Satoshi Nakamura,et al.  CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition , 2010, AVSP.

[13]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[14]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).