RS-CAE-Based AR-Wiener Filtering and Harmonic Recovery for Speech Enhancement

By taking into account temporal correlation of speech feature. In this paper, a novel structure of convolutional Auto Encoder (CAE) was proposed. In this structure, the historical output of the CAE was fed into a CAE stack recurrently. We name this structure as Recurrent Stack Convolutional Auto Encoder (RS-CAE). In the training stage, the training feature maps of the RS-CAE comprise of log power spectrum (LPS) of noisy speech and an additional feature map derived from the LPS of the enhanced speech in the history. In this way, the temporal correlation is incorporated as much as possible in the RS-CAE. The training target is a concatenated vector of auto-regressive (AR) model parameters of speech and noise. At online stage, the LPS of noisy speech and the LPS of the enhanced speech from the history make up input feature maps together. The outputs of the RS-CAE are the AR model parameters of speech and noise, which are used to construct the AR-Wiener filter. Because the estimated AR model parameters are not completely accurate and some harmonics may be lost in the enhanced speech, the codebook-based harmonic recovery technique was proposed to reconstruct harmonic structure of the enhanced speech. The test results confirmed that the proposed method achieved better performance compared with some existing approaches.

[1]  Qi He,et al.  Multiplicative update of AR gains in codebook-driven speech enhancement , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[3]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[5]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[6]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jun Du,et al.  A Multiobjective Learning and Ensembling Approach to High-Performance Speech Enhancement With Compact Neural Network Architectures , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Wonyong Sung,et al.  A voice activity detector employing soft decision based noise spectrum adaptation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[10]  Changchun Bao,et al.  Improved Codebook-Based Speech Enhancement Based on MBE Model , 2017, INTERSPEECH.

[11]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[12]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[14]  Rainer Martin,et al.  Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Bin Chen,et al.  Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian speech modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  James L. McClelland,et al.  Learning Subsequential Structure in Simple Recurrent Networks , 1988, NIPS.

[19]  Wei Xiong,et al.  Stacked Convolutional Denoising Auto-Encoders for Feature Representation , 2017, IEEE Transactions on Cybernetics.

[20]  Björn W. Schuller,et al.  Single-channel speech separation with memory-enhanced recurrent neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Yan Yang,et al.  Dnn-Based Ar-Wiener Filtering for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[23]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[24]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[25]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[26]  Jun Du,et al.  Densely Connected Progressive Learning for LSTM-Based Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[28]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Seyedmahdad Mirsamadi,et al.  A Causal Speech Enhancement Approach Combining Data-driven Learning and Suppression Rule Estimation , 2017 .

[30]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Chin-Hui Lee,et al.  A Hybrid Approach to Combining Conventional and Deep Learning Techniques for Single-Channel Speech Enhancement and Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Zhu Xiaojing,et al.  Speech enhancement using harmonic regeneration , 2011, 2011 IEEE International Conference on Computer Science and Automation Engineering.

[33]  Chin-Hui Lee,et al.  Convolutional-Recurrent Neural Networks for Speech Enhancement , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Jun Du,et al.  Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[35]  L. Fransen,et al.  Application of line-spectrum pairs to low-bit-rate speech encoders , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[36]  Chin-Chen Chang,et al.  A fast LBG codebook training algorithm for vector quantization , 1998 .

[37]  Yang Xiang,et al.  DNN-Based Speech Enhancement Using MBE Model , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[38]  J. Markel,et al.  The SIFT algorithm for fundamental frequency estimation , 1972 .

[39]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[40]  Robert B. Dunn,et al.  Improving Statistical Model-Based Speech Enhancement with Deep Neural Networks , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.