On the Importance of Super-Gaussian Speech Priors for Machine-Learning Based Speech Enhancement

For enhancing noisy signals, machine-learning based single-channel speech enhancement schemes exploit prior knowledge about typical speech spectral structures. To ensure a good generalization and to meet requirements in terms of computational complexity and memory consumption, certain methods restrict themselves to learning speech spectral envelopes. We refer to these approaches as machine-learning spectral envelope (MLSE)-based approaches. In this paper, we show by means of theoretical and experimental analyses that for MLSE-based approaches, super-Gaussian priors allow for a reduction of noise between speech spectral harmonics which is not achievable using Gaussian estimators such as the Wiener filter. For the evaluation, we use a deep neural network based phoneme classifier and a low-rank nonnegative matrix factorization framework as examples of MLSE-based approaches. A listening experiment and instrumental measures confirm that while super-Gaussian priors yield only moderate improvements for classic enhancement schemes, for MLSE-based approaches super-Gaussian priors clearly make an important difference and significantly outperform Gaussian priors.

[1]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[3]  László Tóth Phone recognition with deep sparse rectifier neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Arne Leijon,et al.  A new linear MMSE filter for single channel speech enhancement based on Nonnegative Matrix Factorization , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[5]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[7]  Reinhold Häb-Umbach,et al.  Noise-Presence-Probability-Based Noise PSD Estimation by Using DNNs , 2016, ITG Symposium on Speech Communication.

[8]  Robert Rehr,et al.  MixMax Approximation as a Super-Gaussian Log-Spectral Amplitude Estimator for Speech Enhancement , 2017, INTERSPEECH.

[9]  Hadi Veisi,et al.  Hidden Markov model-based speech enhancement using multivariate Laplace and Gaussian distributions , 2015, IET Signal Process..

[10]  Sharon Gannot,et al.  A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Rainer Martin,et al.  Analysis of the Decision-Directed SNR Estimator for Speech Enhancement With Respect to Low-SNR and Transient Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Richard C. Hendriks,et al.  Noise power estimation based on the probability of speech presence , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[13]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14]  Rainer Martin,et al.  A novel a priori SNR estimation approach based on selective cepstro-temporal smoothing , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Susanto Rahardja,et al.  /spl beta/-order MMSE spectral amplitude estimation for speech enhancement , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Paul R. White,et al.  Mmse Speech Spectral Amplitude Estimators With Chi and Gamma Speech Priors , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Timo Gerkmann Bayesian Estimation of Clean Speech Spectral Coefficients Given a Priori Knowledge of the Phase , 2014, IEEE Transactions on Signal Processing.

[18]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[19]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[20]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[21]  Robert M. Nickel,et al.  Speech Enhancement With Inventory Style Speech Resynthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[23]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Arne Leijon,et al.  Nonnegative HMM for Babble Noise Derived From Speech HMM: Application to Speech Enhancement , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[27]  Terrence J. Sejnowski,et al.  Speech Enhancement Using Gaussian Scale Mixture Models , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Tuomas Virtanen,et al.  Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Tomohiro Nakatani,et al.  Speech enhancement based on log spectral envelope model and harmonicity-derived spectral mask, and its coupling with feature compensation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  W. Bastiaan Kleijn,et al.  HMM-Based Gain Modeling for Enhancement of Speech in Noise , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Qi He,et al.  Multiplicative Update of Auto-Regressive Gains for Codebook-Based Speech Enhancement , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Rainer Martin,et al.  Spectral Domain Speech Enhancement Using HMM State-Dependent Super-Gaussian Priors , 2013, IEEE Signal Processing Letters.

[34]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[35]  Jesper Jensen,et al.  Log-spectral magnitude MMSE estimators under super-Gaussian densities , 2009, INTERSPEECH.

[36]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38]  Paris Smaragdis,et al.  A State-Space Approach to Dynamic Nonnegative Matrix Factorization , 2015, IEEE Transactions on Signal Processing.

[39]  Timo Gerkmann,et al.  On MMSE-Based Estimation of Amplitude and Complex Speech Spectral Coefficients Under Phase-Uncertainty , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Jesper Jensen,et al.  DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement , 2013, DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement.

[41]  Robert M. Nickel,et al.  Corpus-Based Speech Enhancement With Uncertainty Modeling and Cepstral Smoothing , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Rainer Martin,et al.  Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[45]  Jesper Jensen,et al.  Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Rainer Martin,et al.  Speech enhancement based on minimum mean-square error estimation and supergaussian priors , 2005, IEEE Transactions on Speech and Audio Processing.

[47]  Peter Vary,et al.  Noise PSD estimation by logarithmic baseline tracing , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Le Roux Sparse NMF – half-baked or well done? , 2015 .

[49]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[50]  Sharon Gannot,et al.  Speech enhancement using a mixture-maximum model , 1999, IEEE Trans. Speech Audio Process..

[51]  L. Scharf,et al.  Statistical Signal Processing of Complex-Valued Data: The Theory of Improper and Noncircular Signals , 2010 .

[52]  Bernard Widrow,et al.  Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[53]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[54]  Yariv Ephraim,et al.  A Bayesian estimation approach for speech enhancement using hidden Markov models , 1992, IEEE Trans. Signal Process..

[55]  Timo Gerkmann,et al.  MMSE-Optimal Spectral Amplitude Estimation Given the STFT-Phase , 2013, IEEE Signal Processing Letters.