Bimodal Speech Recognition Fusing Audio-Visual Modalities

In this paper, we present a novel bimodal speech recognition technique that fuses both audio information sound signal and visual information movements of lips for Russian speech recognition. We propose an architecture of the automatic system for bimodal recognition of audio-visual speech, which uses one stationary microphone Oktava and one high-speed camera JAI Pulnix 200 frames per second at 640i¾?×i¾?480 pixels to get audio and video signals. We describe also developed software for audio-visual speech database recording, phonemic and visemic structures of the Russian language, as well as probabilistic models of bimodal speech units based on Coupled Hidden Markov Models. Realization of a transformation method from a Coupled Hidden Markov Model into an equivalent 2-stream Hidden Markov Model is presented as well.

[1]  Ara V. Nefian,et al.  Speaker independent audio-visual continuous speech recognition , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[2]  Xiao Li,et al.  Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Alexey Karpov,et al.  A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera , 2014, SPECOM.

[4]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[5]  Alexey A. Karpov An automatic multimodal speech recognition system with audio and video information , 2014, Autom. Remote. Control..

[6]  Andrey Ronzhin,et al.  Large vocabulary Russian speech recognition using syntactico-statistical language modeling , 2014, Speech Commun..

[7]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[8]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[10]  Andrey Ronzhin,et al.  A Universal Assistive Technology with Multimodal Input and Multimedia Output Interfaces , 2014, HCI.

[11]  Oscar Déniz-Suárez,et al.  A comparison of face and facial feature detectors based on the Viola–Jones general object detection framework , 2011, Machine Vision and Applications.

[12]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[13]  Thomas S. Huang,et al.  Multi-Modal sensory Fusion with Application to Audio-Visual Speech Recognition , 2002 .

[14]  Andrey Ronzhin,et al.  Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition , 2010, INTERSPEECH.

[15]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[16]  Alexander L. Ronzhin,et al.  Automatic Analysis of Speech and Acoustic Events for Ambient Assisted Living , 2015, HCI.

[17]  Darryl Stewart,et al.  Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions , 2014, IEEE Transactions on Cybernetics.