论文信息 - DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement

DeepLPC: A Deep Learning Approach to Augmented Kalman Filter-Based Single-Channel Speech Enhancement

Current deep learning approaches to linear prediction coefficient (LPC) estimation for the augmented Kalman filter (AKF) produce bias estimates, due to the use of a whitening filter. This severely degrades the perceived quality and intelligibility of enhanced speech produced by the AKF. In this paper, we propose a deep learning framework that produces clean speech and noise LPC estimates with significantly less bias than previous methods, by avoiding the use of a whitening filter. The proposed framework, called DeepLPC, jointly estimates the clean speech and noise LPC power spectra. The estimated clean speech and noise LPC power spectra are passed through the inverse Fourier transform to form autocorrelation matrices, which are then solved by the Levinson-Durbin recursion to form the LPCs and prediction error variances of the speech and noise for the AKF. The performance of DeepLPC is evaluated on the NOIZEUS and DEMAND Voice Bank datasets using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC is compared to six existing deep learning-based methods. Compared to other deep learning approaches to clean speech LPC estimation, DeepLPC produces a lower spectral distortion (SD) level than existing methods, confirming that it exhibits less bias. DeepLPC also produced higher objective scores than any of the competing methods (with an improvement of 0.11 for CSIG, 0.15 for CBAK, 0.14 for COVL, 0.13 for PESQ, 2.66% for STOI, 1.11 dB for SegSNR, and 1.05 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC was also the most preferred by 10 listeners. By producing less biased clean speech and noise LPC estimates, DeepLPC enables the AKF to produce enhanced speech at a higher quality and intelligibility.

[1] Hongjiang Yu,et al. A Deep Neural Network Based Kalman Filter for Time Domain Speech Enhancement , 2019, 2019 IEEE International Symposium on Circuits and Systems (ISCAS).

[2] Kuldip K. Paliwal,et al. Masked multi-head self-attention for causal speech enhancement , 2020, Speech Commun..

[3] Richard C. Hendriks,et al. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Simon King,et al. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[5] Jerry D. Gibson,et al. Filtering of colored noise for speech enhancement and coding , 1991, IEEE Trans. Signal Process..

[6] Nobutaka Ito,et al. The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[7] Jessica Koehler,et al. Advanced Digital Signal Processing And Noise Reduction , 2016 .

[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9] Kuldip K. Paliwal,et al. Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator , 2012, Speech Commun..

[10] Alex Graves,et al. Neural Machine Translation in Linear Time , 2016, ArXiv.

[11] Kuldip K. Paliwal,et al. A Deep Learning-Based Kalman Filter for Speech Enhancement , 2020, INTERSPEECH.

[12] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[13] Philipos C. Loizou,et al. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] Jonathan Le Roux,et al. SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[16] Kuldip K. Paliwal,et al. Deep Learning with Augmented Kalman Filter for Single-Channel Speech Enhancement , 2020, 2020 IEEE International Symposium on Circuits and Systems (ISCAS).

[17] Stephen So,et al. Investigation of DNN Prediction of Power Spectral Envelopes for Speech Coding & ASR , 2019 .

[18] Junichi Yamagishi,et al. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[19] Pascal Scalart,et al. Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Pascal Scalart,et al. A two-step noise reduction technique , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22] Paul Mermelstein,et al. Evaluation of a segmental SNR measure as an indicator of the quality of ADPCM coded speech , 1979 .

[23] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Junichi Yamagishi,et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[26] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Benoit Champagne,et al. Speech enhancement using a DNN-augmented colored-noise Kalman filter , 2020, Speech Commun..

[28] Thar Baker,et al. Speech Enhancement Algorithm Based on Super-Gaussian Modeling and Orthogonal Polynomials , 2019, IEEE Access.

[29] Naijun Zheng,et al. Phase-Aware Speech Enhancement Based on Deep Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] S. Boll,et al. Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[31] Kuldip K. Paliwal,et al. Deep learning for minimum mean-square error approaches to speech enhancement , 2019, Speech Commun..

[32] Wei-Ping Zhu,et al. Single channel speech enhancement using subband iterative Kalman filter , 2016, 2016 IEEE International Symposium on Circuits and Systems (ISCAS).

[33] Kuldip K. Paliwal,et al. A speech enhancement method based on Kalman filtering , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[35] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[36] Guy J. Brown,et al. Separation of Speech by Computational Auditory Scene Analysis , 2005 .

[37] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[38] Jesper Jensen,et al. MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[41] Jesper Jensen,et al. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[42] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43] Sridha Sridharan,et al. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[44] W. Bastiaan Kleijn,et al. Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[45] DeLiang Wang,et al. Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[46] Kuldip K. Paliwal,et al. DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[47] Yu Tsao,et al. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[48] Nasir Saleem,et al. On Learning Spectral Masking for Single Channel Speech Enhancement Using Feedforward and Recurrent Neural Networks , 2020, IEEE Access.

[49] Kuldip K. Paliwal,et al. Robustness metric-based tuning of the augmented Kalman filter for the enhancement of speech corrupted with coloured noise , 2018, Speech Commun..

[50] Jun Du,et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[51] Tao Zhang,et al. Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[52] Nasser Kehtarnavaz,et al. Smartphone-based real-time classification of noise signals using subband features and random forest classifier , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53] Kuldip K. Paliwal,et al. Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[54] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[55] Nasser Kehtarnavaz,et al. Automatic switching between noise classification and speech enhancement for hearing aid devices , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[56] Richard M. Schwartz,et al. Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.