DNN-Supported Speech Enhancement With Cepstral Estimation of Both Excitation and Envelope

In this paper, we propose and compare various techniques for the estimation of clean spectral envelopes in noisy conditions. The source-filter model of human speech production is employed in combination with a hidden Markov model and/or a deep neural network approach to estimate clean envelope-representing coefficients in the cepstral domain. The cepstral estimators for speech spectral envelope-based noise reduction are both evaluated alone and also in combination with the recently introduced cepstral excitation manipulation (CEM) technique for a priori SNR estimation in a noise reduction framework. Relative to the classical MMSE short time spectral amplitude estimator, we obtain more than 2 dB higher noise attenuation, and relative to our recent CEM technique still 0.5 dB more, in both cases maintaining the quality of the speech component and obtaining considerable SNR improvement.

[1]  W. Bastiaan Kleijn,et al.  HMM-Based Gain Modeling for Enhancement of Speech in Noise , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Panos E. Papamichalis,et al.  Practical approaches to speech coding , 1987 .

[4]  Tomohiro Nakatani,et al.  Speech enhancement based on log spectral envelope model and harmonicity-derived spectral mask, and its coupling with feature compensation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Wouter Tirry,et al.  Two-stage speech enhancement with manipulation of the cepstral excitation , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[6]  Wouter Tirry,et al.  Instantaneous A Priori SNR Estimation by Cepstral Excitation Manipulation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[8]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[9]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[10]  Rainer Martin,et al.  Cepstral Smoothing of Spectral Filter Gains for Speech Enhancement Without Musical Noise , 2007, IEEE Signal Processing Letters.

[11]  Rainer Martin,et al.  On the Statistics of Spectral Amplitudes After Variance Reduction by Temporal Cepstrum Smoothing and Cepstral Nulling , 2009, IEEE Transactions on Signal Processing.

[12]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Sridha Sridharan,et al.  The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[14]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[15]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[16]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Tim Fingscheidt,et al.  Environment-Optimized Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[19]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[21]  Seyedmahdad Mirsamadi,et al.  Causal Speech Enhancement Combining Data-Driven Learning and Suppression Rule Estimation , 2016, INTERSPEECH.

[22]  W. Bastiaan Kleijn,et al.  Codebook-Based Bayesian Speech Enhancement for Nonstationary Environments , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Tim Fingscheidt,et al.  A DNN regression approach to speech enhancement by artificial bandwidth extension , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[24]  Tim Fingscheidt,et al.  Artificial bandwidth extension using deep neural networks for spectral envelope estimation , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[25]  Tim Fingscheidt,et al.  MMSE speech enhancement under speech presence uncertainty assuming (generalized) gamma speech priors throughout , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Rainer Martin,et al.  A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[28]  Jonathan Le Roux,et al.  Non-negative source-filter dynamical system for speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  W. Bastiaan Kleijn,et al.  Speech enhancement using a-priori information , 2003, INTERSPEECH.

[30]  Changchun Bao,et al.  Speech enhancement based on AR model parameters estimation , 2016, Speech Commun..

[31]  Robert Rehr,et al.  A Combination of Pre-Trained Approaches and Generic Methods for an Improved Speech Enhancement , 2016, ITG Symposium on Speech Communication.

[32]  Yariv Ephraim,et al.  A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[33]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[34]  A. Noll Cepstrum pitch determination. , 1967, The Journal of the Acoustical Society of America.

[35]  Christophe Beaugeant,et al.  Overcoming the statistical independence assumption w.r.t. frequency in speech enhancement , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[36]  Tim Fingscheidt,et al.  Artificial Speech Bandwidth Extension Using Deep Neural Networks for Wideband Spectral Envelope Estimation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[38]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[40]  Richard C. Hendriks,et al.  Improved mmse-based noise PSD tracking using temporal cepstrum smoothing , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[43]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[45]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.