Dominant speaker detection based on voicing for adaptive audio-visual ASR robust to speech noise

We investigate the use of voicing in state-of-the-art Large Vocabulary Continuous Audio-visual automatic Speech Recognition (AV-LVCSR). In this work we apply an original adaptive weighting function using voicing level to estimate the appropriate combination weights for each of the modalities. We show that we can improve the state-of-the-art AV-LVCSR performance under speech noise by using a detector of the dominant speaker which is a function of the voicing level. We refine the weighting function according to sensibility and specificity of the dominant speaker detector. In this first experiment, weighting functions are threshold functions of the voicing level. Rather than testing all possible thresholds, three of them are arbitrarily chosen so that the sensitivity, or specificity of the detector, reaches 95%, or so that sensitivity and specificity are equal. Results show that the AV-LVCSR system we use is improved by 5.7% using a weighing function with high sensibility to dominant speaker activity.

[1]  T. Baer,et al.  Harmonics-to-noise ratio as an index of the degree of hoarseness. , 1982, The Journal of the Acoustical Society of America.

[2]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Alexandrina Rogozan,et al.  Adaptive determination of audio and visual weights for automatic speech recognition , 1997, AVSP.

[4]  Hervé Glotin,et al.  A measure of speech and pitch reliability from voicing , 1999, IJCAI 1999.

[5]  Hervé Glotin,et al.  A new SNR-feature mapping for robust multistream speech recognition , 1999 .

[6]  Hervé Glotin,et al.  Test of several external posterior weighting functions for multiband full combination ASR , 2000, INTERSPEECH.

[7]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[8]  Hervé Glotin,et al.  Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Hervé Glotin Elaboration et comparaison de systèmes adaptatifs multi-flux de reconnaissance robuste de la parole : incorporation des indices de voisement et de localisation , 2001 .

[10]  Martin Heckmann,et al.  Optimal weighting of posteriors for audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).