Optimum integration weight for decision fusion audio-visual speech recognition

Automatic speech recognition (ASR) technologies have been successfully applied to several real world applications. But, still there exist several problems that need to be solved for wider application of the technologies. One problem is noise-robustness of recognition performance. Recently, audio-visual speech recognition (AVSR) has received attention as a solution to this problem. In this, visual speech information is used together with acoustic signal for speech recognition in noisy environments. This paper presents a new decision fusion AVSR system, in which the classifier's decision is optimised using genetic algorithm (GA) optimisation technique. Hence, the optimally fused decision fusion AVSR system produces robust recognition accuracy at all SNR conditions. For evaluating the performance of the proposed scheme, the recognition results are compared with those of an equal weight bimodal AVSR system and with another state-of-the-art method, namely, compression and Mel sub-band spectral subtraction (CMSBS)-based noise compensation method for speech recognition in noise. Further, to show the effectiveness of the proposed optimisation method, the recognition results are compared with those of a similar method called directed grid search method, which also optimises the integration weight against the recognition accuracy.

[1]  Dinesh Kant Kumar,et al.  Visual Speech Recognition Using Motion Features and Hidden Markov Models , 2007, CAIP.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Anil Pahwa,et al.  A comparative analysis of genetic algorithms and directed grid search for parametric optimization , 1998, Engineering with Computers.

[4]  Sophie M. Wuerger,et al.  Continuous audio-visual digit recognition using N-best decision fusion , 2004, Inf. Fusion.

[5]  Ahmad Akbari,et al.  SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features , 2007, Pattern Recognit. Lett..

[6]  Bernd Plannerer,et al.  An Introduction to Speech Recognition , 2005 .

[7]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[8]  Liang Dong,et al.  Recognition of Visual Speech Elements Using Hidden Markov Models , 2002, IEEE Pacific Rim Conference on Multimedia.

[9]  P. S. Sathidevi,et al.  Static and Dynamic Features for Improved HMM based Visual Speech Recognition , 2009, IHCI.

[10]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[11]  Cheol Hoon Park,et al.  Adaptive Decision Fusion for Audio-Visual Speech Recognition , 2008 .

[12]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[13]  Trent W. Lewis,et al.  Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition , 2004, ACSC.

[14]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[15]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[16]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[17]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[18]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[19]  H. Franco,et al.  Combining standard and throat microphones for robust speech recognition , 2003, IEEE Signal Processing Letters.

[20]  Sharon M. Thomas,et al.  Contributions of oral and extraoral facial movement to visual and audiovisual speech perception. , 2004, Journal of experimental psychology. Human perception and performance.

[21]  Robert E. Uhrig,et al.  Hybrid Fuzzy - Genetic Technique for Multisensor Fusion , 1996, Inf. Sci..

[22]  Zicheng Liu,et al.  Multi-sensory microphones for robust speech detection, enhancement and recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Izidor Gertner,et al.  Multi-sensor fusion: an Evolutionary algorithm approach , 2006, Inf. Fusion.

[24]  Alexandrina Rogozan,et al.  Adaptive fusion of acoustic and visual sources for automatic speech recognition , 1998, Speech Commun..

[25]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[26]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[27]  Singiresu S. Rao Engineering Optimization : Theory and Practice , 2010 .

[28]  Sadaoki Furui,et al.  Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images , 2007, EURASIP J. Audio Speech Music. Process..

[29]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[30]  Sridhar P. Arjunan,et al.  Voiceless speech recognition using dynamic visual speech features , 2006 .