Speech Enhancement with Zero-Shot Model Selection

Recent research on speech enhancement (SE) has seen the emergence of deep learning-based methods. It is still a challenging task to determine effective ways to increase the generalizability of SE under diverse test conditions. In this paper, we combine zero-shot learning and ensemble learning to propose a zero-shot model selection (ZMOS) approach to increase the generalization of SE performance. The proposed approach is realized in two phases, namely offline and online phases. The offline phase clusters the entire set of training data into multiple subsets, and trains a specialized SE model (termed component SE model) with each subset. The online phase selects the most suitable component SE model to carry out enhancement. Two selection strategies are developed: selection based on quality score (QS) and selection based on quality embedding (QE). Both QS and QE are obtained by a Quality-Net, a non-intrusive quality assessment network. In the offline phase, the QS or QE of a train-ing utterance is used to group the training data into clusters. In the online phase, the QS or QE of the test utterance is used to identify the appropriate component SE model to perform enhancement on the test utterance. Experimental results have confirmed that the proposed ZMOS approach can achieve better performance in both seen and unseen noise types compared to the baseline systems, which indicates the effectiveness of the proposed approach to provide robust SE performance.

[1]  Paris Smaragdis,et al.  Experiments on deep learning for speech denoising , 2014, INTERSPEECH.

[2]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[3]  Rainer Stiefelhagen,et al.  Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ryandhimas E. Zezario,et al.  Specialized Speech Enhancement Model Selection Based on Learned Non-Intrusive Quality Assessment Metric , 2019, INTERSPEECH.

[5]  R. Martin,et al.  New speech enhancement techniques for low bit rate speech coding , 1999, 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No.99EX351).

[6]  Hao Tang,et al.  VoiceID Loss: Speech Enhancement for Speaker Verification , 2019, INTERSPEECH.

[7]  Zheng-Hua Tan,et al.  Speech enhancement using Long Short-Term Memory based recurrent Neural Networks for noise robust Speaker Verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[8]  Tim Fingscheidt,et al.  Convolutional Neural Networks to Enhance Coded Speech , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[10]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[11]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[12]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[13]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[14]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yi Hu,et al.  Evaluation of Noise Reduction Methods for Sentence Recognition by Mandarin-Speaking Cochlear Implant Listeners , 2015, Ear and hearing.

[16]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[17]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Florian Metze,et al.  Zero-shot Learning for Speech Recognition with Universal Phonetic Model , 2018 .

[20]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[21]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Chandan K. A. Reddy,et al.  Smartphone based real-time super Gaussian single microphone Speech Enhancement to improve intelligibility for hearing aid users using formant information , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[23]  Yu Tsao,et al.  Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM , 2018, INTERSPEECH.

[24]  Yu Tsao,et al.  End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  DeLiang Wang,et al.  Deep learning reinvents the hearing aid , 2017, IEEE Spectrum.

[26]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Xin Wang,et al.  Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Panayiotis G. Georgiou,et al.  Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement , 2016, INTERSPEECH.

[30]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[31]  Kristen Grauman,et al.  Zero-shot recognition with unreliable attributes , 2014, NIPS.

[32]  Thomas Brox,et al.  Lucid Data Dreaming for Object Tracking , 2017, ArXiv.

[33]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).