Audio-Visual Speech Enhancement using Hierarchical Extreme Learning Machine

Recently, the hierarchical extreme learning machine (HELM) model has been utilized for speech enhancement (SE) and demonstrated promising performance, especially when the amount of training data is limited and the system does not support heavy computations. Based on the success of audio-onlybased systems, termed AHELM, we propose a novel audio-visual HELM-based SE system, termed AVHELM that integrates the audio and visual information to confrontate the unseen nonstationery noise problem at low SNR levels to attain improved SE performance. The experimental results demonstrate that AVHELM can yield satisfactory enhancement performance with a limited amount of training data and outperforms AHELM in terms of three standardized objective measures under matched and mismatched testing conditions, confirming the effectiveness of incorporating visual information into the HELM-based SE system.

[1]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[2]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[3]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[4]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[5]  James M. Kates,et al.  The Hearing-Aid Speech Perception Index (HASPI) , 2014, Speech Commun..

[6]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Satoshi Tamura,et al.  Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[8]  Javier Ortega-Garcia,et al.  Overview of speech enhancement techniques for automatic speaker recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Jesper Jensen,et al.  Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Tsao,et al.  Bone-Conducted Speech Enhancement Using Hierarchical Extreme Learning Machine , 2019, IWSDS.

[11]  Vaibhava Goel,et al.  Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shmuel Peleg,et al.  Visual Speech Enhancement , 2017, INTERSPEECH.

[13]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[14]  Chih-Hao Fang,et al.  台灣地區噪音下漢語語音聽辨測驗之軟體發展;Software Development of Taiwan Mandarin Hearing In Noise Test , 2018 .

[15]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Guang-Bin Huang,et al.  Extreme Learning Machine for Multilayer Perceptron , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[18]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[19]  Changchun Bao,et al.  Speech enhancement with weighted denoising auto-encoder , 2013, INTERSPEECH.

[20]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[21]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  David V. Anderson,et al.  A framework for speech enhancement using extreme learning machines , 2017, 2017 51st Asilomar Conference on Signals, Systems, and Computers.

[23]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[24]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[25]  Yu Tsao,et al.  Experimental Study on Extreme Learning Machine Applications for Speech Enhancement , 2017, IEEE Access.

[26]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  A. Cuhadar,et al.  Evaluation of Speech Enhancement Techniques for Speaker Identification in Noisy Environments , 2007, Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007).

[28]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.