论文信息 - On the Importance of Super-Gaussian Speech Priors for Machine-Learning Based Speech Enhancement

On the Importance of Super-Gaussian Speech Priors for Machine-Learning Based Speech Enhancement

For enhancing noisy signals, machine-learning based single-channel speech enhancement schemes exploit prior knowledge about typical speech spectral structures. To ensure a good generalization and to meet requirements in terms of computational complexity and memory consumption, certain methods restrict themselves to learning speech spectral envelopes. We refer to these approaches as machine-learning spectral envelope (MLSE)-based approaches. In this paper, we show by means of theoretical and experimental analyses that for MLSE-based approaches, super-Gaussian priors allow for a reduction of noise between speech spectral harmonics which is not achievable using Gaussian estimators such as the Wiener filter. For the evaluation, we use a deep neural network based phoneme classifier and a low-rank nonnegative matrix factorization framework as examples of MLSE-based approaches. A listening experiment and instrumental measures confirm that while super-Gaussian priors yield only moderate improvements for classic enhancement schemes, for MLSE-based approaches super-Gaussian priors clearly make an important difference and significantly outperform Gaussian priors.

Robert Rehr | Timo Gerkmann | Timo Gerkmann | R. Rehr

[1] Tuomas Virtanen,et al. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Martin Fodslette Møller,et al. A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[3] László Tóth. Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Arne Leijon,et al. A new linear MMSE filter for single channel speech enhancement based on Nonnegative Matrix Factorization , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[5] W. Bastiaan Kleijn,et al. Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6] Olli Viikki,et al. Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[7] Reinhold Häb-Umbach,et al. Noise-Presence-Probability-Based Noise PSD Estimation by Using DNNs , 2016, ITG Symposium on Speech Communication.

[8] Robert Rehr,et al. MixMax Approximation as a Super-Gaussian Log-Spectral Amplitude Estimator for Speech Enhancement , 2017, INTERSPEECH.

[9] Hadi Veisi,et al. Hidden Markov model-based speech enhancement using multivariate Laplace and Gaussian distributions , 2015, IET Signal Process..

[10] Sharon Gannot,et al. A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Rainer Martin,et al. Analysis of the Decision-Directed SNR Estimator for Speech Enhancement With Respect to Low-SNR and Transient Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Richard C. Hendriks,et al. Noise power estimation based on the probability of speech presence , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14] Rainer Martin,et al. A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Susanto Rahardja,et al. /spl beta/-order MMSE spectral amplitude estimation for speech enhancement , 2005, IEEE Transactions on Speech and Audio Processing.

[16] Paul R. White,et al. Mmse Speech Spectral Amplitude Estimators With Chi and Gamma Speech Priors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17] Timo Gerkmann. Bayesian Estimation of Clean Speech Spectral Coefficients Given a Priori Knowledge of the Phase , 2014, IEEE Transactions on Signal Processing.

[18] David Malah,et al. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[19] Rainer Martin,et al. Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[20] Björn W. Schuller,et al. Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[21] Robert M. Nickel,et al. Speech Enhancement With Inventory Style Speech Resynthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[23] Tara N. Sainath,et al. Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Arne Leijon,et al. Nonnegative HMM for Babble Noise Derived From Speech HMM: Application to Speech Enhancement , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[27] Terrence J. Sejnowski,et al. Speech Enhancement Using Gaussian Scale Mixture Models , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28] Tuomas Virtanen,et al. Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Tomohiro Nakatani,et al. Speech enhancement based on log spectral envelope model and harmonicity-derived spectral mask, and its coupling with feature compensation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] W. Bastiaan Kleijn,et al. HMM-Based Gain Modeling for Enhancement of Speech in Noise , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Qi He,et al. Multiplicative Update of Auto-Regressive Gains for Codebook-Based Speech Enhancement , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] Rainer Martin,et al. Spectral Domain Speech Enhancement Using HMM State-Dependent Super-Gaussian Priors , 2013, IEEE Signal Processing Letters.

[34] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[35] Jesper Jensen,et al. Log-spectral magnitude MMSE estimators under super-Gaussian densities , 2009, INTERSPEECH.

[36] Richard C. Hendriks,et al. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38] Paris Smaragdis,et al. A State-Space Approach to Dynamic Nonnegative Matrix Factorization , 2015, IEEE Transactions on Signal Processing.

[39] Timo Gerkmann,et al. On MMSE-Based Estimation of Amplitude and Complex Speech Spectral Coefficients Under Phase-Uncertainty , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40] Jesper Jensen,et al. DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[41] Robert M. Nickel,et al. Corpus-Based Speech Enhancement With Uncertainty Modeling and Cepstral Smoothing , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42] Rainer Martin,et al. Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43] Paris Smaragdis,et al. Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[44] Peter Vary,et al. Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[45] Jesper Jensen,et al. Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[46] Rainer Martin,et al. Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[47] Peter Vary,et al. Noise PSD estimation by logarithmic baseline tracing , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] Le Roux. Sparse NMF – half-baked or well done? , 2015 .

[49] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[50] Sharon Gannot,et al. Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[51] L. Scharf,et al. Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals , 2010 .

[52] Bernard Widrow,et al. Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[53] Nancy Bertin,et al. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[54] Yariv Ephraim,et al. A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[55] Timo Gerkmann,et al. MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase , 2013, IEEE Signal Processing Letters.