Audio-Visual Speech Enhancement based on Multimodal Deep Convolutional Neural Network

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus on addressing audio information only. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then, fused into a joint network to generate enhanced speech (the primary task) and reconstructed image (the secondary task) at the output layer. It is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields notably better performance, compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

[1]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[3]  Ning Ma,et al.  Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yonghong Yan,et al.  Comparative intelligibility investigation of single-channel noise-reduction algorithms for Chinese, Japanese, and English. , 2011, The Journal of the Acoustical Society of America.

[5]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Bhaskar D. Rao,et al.  On-line learning algorithms for locally recurrent neural networks , 1999, IEEE Trans. Neural Networks.

[8]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Nikos Fakotakis,et al.  Objective comparison of speech enhancement algorithms under real world conditions , 2008, PETRA '08.

[10]  James M. Kates,et al.  The Hearing-Aid Speech Quality Index (HASQI) , 2010 .

[11]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[13]  Ben P. Milner,et al.  Enhancing audio speech using visual speech features , 2009, INTERSPEECH.

[14]  Fergus McInnes,et al.  Lateral inhibition net and weighted matching algorithms for speech recognition in noise , 1996 .

[15]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[16]  Aurelio Uncini,et al.  Subband neural networks prediction for on-line audio signal recovery , 2002, IEEE Trans. Neural Networks.

[17]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[18]  Sridha Sridharan,et al.  Multiple cameras for audio-visual speech recognition in an automotive environment , 2013, Comput. Speech Lang..

[19]  Yu Tsao,et al.  An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition , 2013, INTERSPEECH.

[20]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[21]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[22]  Yu Tsao,et al.  Ensemble modeling of denoising autoencoder for speech spectrum restoration , 2014, INTERSPEECH.

[23]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[24]  Chalapathy Neti,et al.  Noisy audio feature enhancement using audio-visual speech data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[26]  Javier Ortega-Garcia,et al.  Overview of speech enhancement techniques for automatic speaker recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[27]  Rick Siow Mong Goh,et al.  Multi-Modal Hybrid Deep Neural Network for Speech Enhancement , 2016, ArXiv.

[28]  Yu Tsao,et al.  Generalized maximum a posteriori spectral amplitude estimation for speech enhancement , 2016, Speech Commun..

[29]  Gerasimos Potamianos,et al.  Audio-visual speech activity detection in a two-speaker scenario incorporating depth information from a profile or frontal view , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[31]  Yu Tsao,et al.  A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[32]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[33]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[35]  Yu Tsao,et al.  Audio-visual speech enhancement using deep neural networks , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[36]  Saeed Gazor,et al.  An adaptive KLT approach for speech enhancement , 2001, IEEE Trans. Speech Audio Process..

[37]  Francesco Piazza,et al.  Comparative Evaluation of Single-Channel MMSE-Based Noise Reduction Schemes for Speech Recognition , 2010, J. Electr. Comput. Eng..

[38]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Jacob Benesty,et al.  Fundamentals of Noise Reduction , 2008 .

[40]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[41]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[42]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[43]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[44]  Björn W. Schuller,et al.  Real-life voice activity detection with LSTM Recurrent Neural Networks and an application to Hollywood movies , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[47]  Aurelio Uncini,et al.  Audio signal processing by neural networks , 2003, Neurocomputing.

[48]  Alexandros Iosifidis,et al.  Visual Voice Activity Detection in the Wild , 2016, IEEE Transactions on Multimedia.

[49]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Mahesh Chandra,et al.  Multiple cameras audio visual speech recognition using active appearance model visual features in car environment , 2016, Int. J. Speech Technol..

[51]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Dorothea Kolossa,et al.  Twin-HMM-based audio-visual speech enhancement , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[54]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[55]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[56]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[57]  Jean-Philippe Thiran,et al.  On Dynamic Stream Weighting for Audio-Visual Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Yu Tsao,et al.  A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom , 2017, IEEE Access.

[59]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[60]  Yu Tsao,et al.  Deep Learning–Based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients , 2018, Ear and hearing.

[61]  Junfeng Li,et al.  Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication , 2011, Speech Commun..

[62]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[63]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[64]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Dorothea Kolossa,et al.  Audiovisual speech recognition with missing or unreliable data , 2009, AVSP.

[66]  A. Cuhadar,et al.  Evaluation of Speech Enhancement Techniques for Speaker Identification in Noisy Environments , 2007, Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007).

[67]  Minsoo Hahn,et al.  Dual-Microphone Noise Reduction in Car Environments With Determinant Analysis of Input Correlation Matrix , 2016, IEEE Sensors Journal.

[68]  Yi Hu,et al.  Evaluation of Noise Reduction Methods for Sentence Recognition by Mandarin-Speaking Cochlear Implant Listeners , 2015, Ear and hearing.

[69]  Francesco Piazza,et al.  Nonlinear Speech Enhancement: An Overview , 2005, WNSP.

[70]  James M. Kates,et al.  The Hearing-Aid Speech Perception Index (HASPI) , 2014, Speech Commun..

[71]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[72]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[73]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[74]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[75]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[76]  Maja Pantic,et al.  Gauss-Newton Deformable Part Models for Face Alignment In-the-Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.