Using a High-Speed Video Camera for Robust Audio-Visual Speech Recognition in Acoustically Noisy Conditions

The purpose of this study is to develop a robust audio-visual speech recognition system and to investigate the influence of a high-speed video data on the recognition accuracy of continuous Russian speech under different noisy conditions. Developed experimental setup and collected multimodal database allow us to explore the impact brought by the high-speed video recordings with various frames per second (fps) starting from standard 25 fps up to high-speed 200 fps. At the moment there is no research objectively reflecting the dependence of the speech recognition accuracy from the video frame rate. Also there are no relevant audio-visual databases for model training. In this paper, we try to fill in this gap for continuous Russian speech. Our evaluation experiments show the increase of absolute recognition accuracy up to 3% and prove that the use of the high-speed camera JAI Pulnix with 200 fps allows achieving better recognition results under different acoustically noisy conditions.

[1]  Elena Grishina,et al.  Multimodal Russian Corpus (MURCO): First Steps , 2010, LREC.

[2]  Yuichi Ohta,et al.  Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor , 2009, ICDP.

[3]  Andrey Ronzhin,et al.  Viseme-dependent weight optimization for CHMM-based audio-visual speech recognition , 2010, INTERSPEECH.

[4]  Darryl Stewart,et al.  Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions , 2014, IEEE Transactions on Cybernetics.

[5]  Alexander L. Ronzhin,et al.  HAVRUS Corpus: High-Speed Recordings of Audio-Visual Russian Speech , 2016, SPECOM.

[6]  Jon Barker,et al.  Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment , 2008, Speech Commun..

[7]  Milos Zelezný,et al.  Czech audio-visual speech corpus of a car driver for in-vehicle audio-visual speech recognition , 2003, AVSP.

[8]  Léon J. M. Rothkrantz,et al.  Automatic Lip Reading in the Dutch Language Using Active Appearance Models on High Speed Recordings , 2010, TSD.

[9]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[10]  Alexey Karpov,et al.  A Framework for Recording Audio-Visual Speech Corpora with a Microphone and a High-Speed Camera , 2014, SPECOM.

[11]  Thomas S. Huang,et al.  Multi-Modal sensory Fusion with Application to Audio-Visual Speech Recognition , 2002 .

[12]  Christian Jutten,et al.  Challenges in multimodal data fusion , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[13]  Jan Zelinka,et al.  Design and recording of Czech speech corpus for audio-visual continuous speech recognition , 2005, AVSP.

[14]  Vinay Bettadapura,et al.  Face Expression Recognition and Analysis: The State of the Art , 2012, ArXiv.

[15]  Alexey A. Karpov An automatic multimodal speech recognition system with audio and video information , 2014, Autom. Remote. Control..

[16]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  K. Ohzeki,et al.  Video Analysis for Detecting Eye Blinking using a High-Speed Camera , 2006, 2006 Fortieth Asilomar Conference on Signals, Systems and Computers.

[18]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.