Audio-visual feature fusion via deep neural networks for automatic speech recognition

Abstract The brain-like functionality of the artificial neural networks besides their great performance in various areas of scientific applications, make them a reliable tool to be employed in Audio-Visual Speech Recognition (AVSR) systems. The applications of such networks in the AVSR systems extend from the preliminary stage of feature extraction to the higher levels of information combination and speech modeling. In this paper, some carefully designed deep autoencoders are proposed to produce efficient bimodal features from the audio and visual stream inputs. The basic proposed structure is modified in three proceeding steps to make better usage of the presence of the visual information from the speakers' lips Region of Interest (ROI). The performance of the proposed structures is compared to both the unimodal and bimodal baselines in a professional phoneme recognition task, under different noisy audio conditions. This is done by employing a state-of-the-art DNN-HMM hybrid as the speech classifier. In comparison to the MFCC audio-only features, the finally proposed bimodal features cause an average relative reduction of 36.9% for a range of different noisy conditions, and also, a relative reduction of 19.2% for the clean condition in terms of the Phoneme Error Rates (PER).

[1]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[4]  Sridha Sridharan,et al.  Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification , 2001, Digit. Signal Process..

[5]  Farshad Almasganj,et al.  Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features , 2017, 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA).

[6]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[7]  Satoshi Tamura,et al.  Investigation of DNN-Based Audio-Visual Speech Recognition , 2016, IEICE Trans. Inf. Syst..

[8]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[9]  Seyyed Ali Seyyedsalehi,et al.  A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks , 2015, Neurocomputing.

[10]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[11]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[12]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Yuan Yuan,et al.  Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading , 2017, ArXiv.

[17]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[19]  Murat Hüsnü Sazli,et al.  Speech recognition with artificial neural networks , 2010, Digit. Signal Process..

[20]  Maja Pantic,et al.  End-to-end visual speech recognition with LSTMS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[22]  Mehryar Mohri,et al.  Speech Recognition with Weighted Finite-State Transducers , 2008 .

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[25]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[26]  Darryl Stewart,et al.  Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions , 2014, IEEE Transactions on Cybernetics.

[27]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.