Speech Enhancement Using Generative Dictionary Learning

The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  J. Eggert,et al.  Sparse coding and NMF , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[3]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[4]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[5]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[7]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[8]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[10]  Yang Lu,et al.  A geometric approach to spectral subtraction , 2008, Speech Commun..

[11]  Christian Jutten,et al.  Parametric dictionary learning using steepest descent , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[13]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[14]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[15]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[16]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[17]  Philipos C. Loizou,et al.  Improving Speech Intelligibility in Noise Using Environment-Optimized Algorithms , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Joachim M. Buhmann,et al.  Speech enhancement with sparse coding in learned dictionaries , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  S. Mallat,et al.  Adaptive time-frequency decomposition with matching pursuits , 1992, [1992] Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis.

[20]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[21]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[22]  Le Li,et al.  SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding: SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding , 2009 .

[23]  Stéphane Mallat,et al.  A Wavelet Tour of Signal Processing - The Sparse Way, 3rd Edition , 2008 .

[24]  W. Bastiaan Kleijn,et al.  Codebook-Based Bayesian Speech Enhancement for Nonstationary Environments , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[26]  John Princen,et al.  Analysis/Synthesis filter bank design based on time domain aliasing cancellation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[27]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[28]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[29]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[30]  Tim Fingscheidt,et al.  Environment-Optimized Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  J. Larsen,et al.  Reduction of non-stationary noise using a non-negative latent variable decomposition , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[32]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[33]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[34]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.

[35]  Michael Elad,et al.  On the stability of the basis pursuit in the presence of noise , 2006, Signal Process..

[36]  D. L. Donoho,et al.  Ideal spacial adaptation via wavelet shrinkage , 1994 .

[37]  Michael Elad,et al.  Dictionaries for Sparse Representation Modeling , 2010, Proceedings of the IEEE.

[38]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[39]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[40]  Michael Elad,et al.  Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit , 2008 .

[41]  Daniel P. W. Ellis,et al.  Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[42]  Ronald E. Crochiere,et al.  A study of complexity and quality of speech waveform coders , 1978, ICASSP.

[43]  Zhifeng Zhang,et al.  Adaptive time-frequency decompositions , 1994 .

[44]  Pierre Vandergheynst,et al.  Compressed Sensing and Redundant Dictionaries , 2007, IEEE Transactions on Information Theory.

[45]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[46]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .