Adaptive Reliability Measure and Optimum Integration Weight for Decision Fusion Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech has received attention recently because of its robustness in noisy environments. An important issue in decision fusion based AVSR system is the determination of appropriate integration weight for the speech modalities to integrate and ensure better performance under various SNR conditions. Generally, the integration weight is calculated from the relative reliability of two modalities. This paper investigates the effect of reliability measure on integration weight estimation and proposes a genetic algorithm (GA) based reliability measure which uses optimum number of best recognition hypotheses rather than N best recognition hypotheses to determine an appropriate integration weight. Further improvement in recognition accuracy is achieved by optimizing the above measured integration weight by genetic algorithm. The performance of the proposed integration weight estimation scheme is demonstrated for isolated word recognition (incorporating commonly used functions in mobile phones) via multi-speaker database experiment. The results show that the proposed schemes improve robust recognition accuracy over the conventional unimodal systems, and a couple of related existing bimodal systems, namely, the baseline reliability ratio-based system and N best recognition hypotheses reliability ratio-based system under various SNR conditions.

[1]  P. Arnold,et al.  Bisensory augmentation: a speechreading advantage when speech is clearly audible and intact. , 2001, British journal of psychology.

[2]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[3]  Sadaoki Furui,et al.  Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images , 2007, EURASIP J. Audio Speech Music. Process..

[4]  Randy L. Haupt,et al.  Practical Genetic Algorithms , 1998 .

[5]  Sridhar P. Arjunan,et al.  Voiceless speech recognition using dynamic visual speech features , 2006 .

[6]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[7]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[8]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[9]  Darryl Stewart,et al.  Audio-visual integration for robust speech recognition using maximum weighted stream posteriors , 2007, INTERSPEECH.

[10]  Trent W. Lewis,et al.  Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition , 2004, ACSC.

[11]  Dinesh Kant Kumar,et al.  Visual Speech Recognition Using Motion Features and Hidden Markov Models , 2007, CAIP.

[12]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[13]  Jean-Luc Schwartz,et al.  Models for Audiovisual Fusion in a Noisy-Vowel Recognition Task , 1998, J. VLSI Signal Process..

[14]  Alexandrina Rogozan,et al.  Adaptive fusion of acoustic and visual sources for automatic speech recognition , 1998, Speech Commun..

[15]  Sadaoki Furui,et al.  A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[17]  P. L. Silsbee Sensory integration in audiovisual automatic speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[18]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[19]  Xiaoping Wang,et al.  Audio-Visual Automatic Speech Recognition for Connected Digits , 2008, 2008 Second International Symposium on Intelligent Information Technology Application.

[20]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[21]  Liang Dong,et al.  Recognition of Visual Speech Elements Using Hidden Markov Models , 2002, IEEE Pacific Rim Conference on Multimedia.

[22]  P. S. Sathidevi,et al.  Static and Dynamic Features for Improved HMM based Visual Speech Recognition , 2009, IHCI.

[23]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[24]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[25]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[26]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[27]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[28]  Cheol Hoon Park,et al.  Adaptive Decision Fusion for Audio-Visual Speech Recognition , 2008 .

[29]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..

[30]  R. Campbell,et al.  Hearing by eye : the psychology of lip-reading , 1988 .