Speech Enhancement Using a Two-Stage Network for an Efficient Boosting Strategy

A novel neural network architecture, called two-stage network (TSN), with a multi-objective learning (MOL) method for an efficient boosting strategy (BS) is proposed for speech enhancement. BS is an ensemble method using multiple base predictions (MBPs) for better final prediction. Because of the necessity for MBPs, the computational cost and model size of BS-based methods are greater than those of a single model. In overcoming this, TSN first obtains MBPs from a single deep neural network. Then, to obtain better final prediction, the convolution layers of TSN aggregate not only MBPs but also some auxiliary information such as contextual information, while adaptively filtering out some unnecessary information, e.g., poor base predictions. At the training phase, the MOL enables all stages of TSN to learn jointly, whereas allowing the TSN framework to embed a BS. Our experimental results confirm that the embedded BS leads TSN to outperform other baseline methods with a reasonably low computational cost and model size.

[1]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[2]  Ming Tu,et al.  Speech enhancement based on Deep Neural Networks with skip connections , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[5]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[7]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[8]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jun Du,et al.  A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement With Compact Neural Network Architectures , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[15]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Hong-Goo Kang,et al.  Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement , 2018, IEEE Signal Processing Letters.

[17]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[18]  Chin-Hui Lee,et al.  Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Minsoo Hahn,et al.  Voice Activity Detection Using an Adaptive Context Attention Model , 2018, IEEE Signal Processing Letters.

[23]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[24]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.

[26]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[27]  Tao Zhang,et al.  DNN-based enhancement of noisy and reverberant speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Thomas Fang Zheng,et al.  Unseen Noise Estimation Using Separable Deep Auto Encoder for Speech Enhancement , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.