Speech Recognition by Integrating Audio, Visual and Contextual Features Based on Neural Networks

Recent researches have been focusing on fusion of audio and visual features for reliable speech recognition in noisy environments. In this paper, we propose a neural network based model of robust speech recognition by integrating audio, visual, and contextual information. Bimodal Neural Network (BMNN) is a multi-layer perceptron of 4 layers, which combines audio and visual features of speech to compensate loss of audio information caused by noise. In order to improve the accuracy of speech recognition in noisy environments, we also propose a post-processing based on contextual information which are sequential patterns of words spoken by a user. Our experimental results show that our model outperforms any single mode models. Particularly, when we use the contextual information, we can obtain over 90% recognition accuracy even in noisy environments, which is a significant improvement compared with the state of art in speech recognition.

[1]  Alexander H. Waibel,et al.  Multi-State Time Delay Networks for Continuous Speech Recognition , 1991, NIPS.

[2]  Mark A. Clements,et al.  Bimodal fusion in audio-visual speech recognition , 2002, Proceedings. International Conference on Image Processing.

[3]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[4]  Alex Waibel,et al.  Bimodal sensor integration on the example of 'speechreading' , 1993, IEEE International Conference on Neural Networks.

[5]  Rhee Man Kil,et al.  Auditory processing of speech signals for robust speech recognition in real-world noisy environments , 1999, IEEE Trans. Speech Audio Process..

[6]  Kuntal Sengupta,et al.  Audio-visual modeling for bimodal speech recognition , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[7]  Roberto Gemello,et al.  Multi-source neural networks for speech recognition: a review of recent results , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[8]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[9]  J. Tebelskis,et al.  Speech Recognition Using Neural Networks , 1996 .