Speech Enhancement Under Low SNR Conditions Via Noise Estimation Using Sparse and Low-Rank NMF with Kullback–Leibler Divergence

A key stage in speech enhancement is noise estimation which usually requires prior models for speech or noise or both. However, prior models can sometimes be difficult to obtain. In this paper, without any prior knowledge of speech and noise, sparse and low-rank nonnegative matrix factorization (NMF) with Kullback-Leibler divergence is proposed to noise and speech estimation by decomposing the input noisy magnitude spectrogram into a low-rank noise part and a sparse speech-like part. This initial unsupervised speech-noise estimation allows us to set a subsequent regularized version of NMF or convolutional NMF to reconstruct the noise and speech spectrogram, either by estimating a speech dictionary on the fly (categorized as unsupervised approaches) or by using a pre-trained speech dictionary on utterances with disjoint speakers (categorized as semi-supervised approaches). Information fusion was investigated by taking the geometric mean of the outputs from multiple enhancement algorithms. The performance of the algorithms were evaluated on five metrics (PESQ, SDR, SNR, STOI, and OVERALL) by making experiments on TIMIT with 15 noise types. The geometric means of the proposed unsupervised approaches outperformed spectral subtraction (SS), minimum mean square estimation (MMSE) under low input SNR conditions. All the proposed semi-supervised approaches showed superiority over SS and MMSE and also obtained better performance than the state-of-the-art algorithms which utilized a prior noise or speech dictionary under low SNR conditions.

[1]  Dacheng Tao,et al.  GoDec: Randomized Lowrank & Sparse Matrix Decomposition in Noisy Case , 2011, ICML.

[2]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[4]  Thomas F. Quatieri,et al.  Speech Enhancement Using Sparse Convolutive Non-negative Matrix Factorization with Basis Adaptation , 2012, INTERSPEECH.

[5]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Xuan Li,et al.  Robust Nonnegative Matrix Factorization via Half-Quadratic Minimization , 2012, 2012 IEEE 12th International Conference on Data Mining.

[8]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[9]  Frank K. Soong,et al.  A Sparse and Low-rank approach to efficient face alignment for photo-real talking head synthesis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Naveen Kumar,et al.  Fusion of diverse denoising systems for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Mike Brookes,et al.  Adaptive Hidden Markov Models for noise modelling , 2011, 2011 19th European Signal Processing Conference.

[14]  J. Larsen,et al.  Wind Noise Reduction using Non-Negative Sparse Coding , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[15]  Bert de Vries,et al.  Online Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[17]  Hamid Sheikhzadeh,et al.  HMM-based strategies for enhancement of speech signals embedded in nonstationary noise , 1998, IEEE Trans. Speech Audio Process..

[18]  Rainer Martin,et al.  Spectral Domain Speech Enhancement Using HMM State-Dependent Super-Gaussian Priors , 2013, IEEE Signal Processing Letters.

[19]  Jonathan Le Roux,et al.  Ensemble learning for speech enhancement , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[20]  Thomas Fang Zheng,et al.  Online Non-Negative Convolutive Pattern Learning for Speech Signals , 2013, IEEE Transactions on Signal Processing.

[21]  Daniel P. W. Ellis,et al.  Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[22]  Hugo Van hamme,et al.  Large Scale Graph Regularized Non-Negative Matrix Factorization With ${\cal \ell}_1$ Normalization Based on Kullback–Leibler Divergence , 2012, IEEE Transactions on Signal Processing.

[23]  Paris Smaragdis,et al.  Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[24]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Barak A. Pearlmutter,et al.  Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[26]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[29]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[32]  Philipos C. Loizou,et al.  Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum , 2005, IEEE Transactions on Speech and Audio Processing.

[33]  Philipos C. Loizou,et al.  A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  Kuldip K. Paliwal,et al.  Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[35]  Jonathan Le Roux,et al.  Non-negative dynamical system with application to speech and audio , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.