论文信息 - Stream confidence estimation for audio-visual speech recognition

Stream confidence estimation for audio-visual speech recognition

We investigate the use of single modality confidence measures as a means of estimating adaptive, local weights for improved audio-visual automatic speech recognition. We limit our work to the toy problem of audio-visual phonetic classification by means of a two-stream Gaussian mixture model (GMM), where each stream models the class conditional audioor visual-only observation probability, raised to an appropriate exponent. We consider such stream exponents as two-dimensional piecewise constant functions of the audio and visual stream local confidences, and we estimate them by minimizing the misclassification error on a held-out data set. Three stream confidence measures are investigated, namely the stream entropy, the n-best likelihood ratio average, and an n-best stream likelihood dispersion measure. The later results in superior audio-visual phonetic classification, as indicated by our experiments on a 260-subject, 40-hour long, large vocabulary, continuous speech audio-visual dataset. By using local, dispersion-based stream exponents, we achieve an additional 20% phone classification accuracy improvement over the improvement that global stream exponents add to clean audio-only phonetic classification. The performance of the algorithm however still falls significantly short of an “oracle” (cheating) confidence estimation scheme.

Chalapathy Neti | Gerasimos Potamianos

[1] Anil K. Jain,et al. Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2] Pierre Jourlin,et al. Word-dependent acoustic-labial weights in HMM-based speech recognition , 1997, AVSP.

[3] Giridharan Iyengar,et al. A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[4] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[5] Katsuhiko Shirai,et al. A recombination strategy for multi-band speech recognition based on mutual information criterion , 1999, EUROSPEECH.

[6] David G. Stork,et al. Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[7] Juergen Luettin,et al. Towards speaker independent continuous speechreading , 1997, EUROSPEECH.

[8] David G. Stork,et al. Speech recognition and sensory integration , 1998 .

[9] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[10] Stephen J. Cox,et al. Audiovisual speech recognition using multiscale nonlinear image decomposition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11] Dominic W. Massaro,et al. SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[12] Alexandrina Rogozan,et al. Adaptive determination of audio and visual weights for automatic speech recognition , 1997, AVSP.

[13] Gerasimos Potamianos,et al. Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[14] Yochai Konig,et al. "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] A. Adjoudani,et al. On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[16] Hervé Bourlard,et al. A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.