Supervised single-channel speech enhancement using ratio mask with joint dictionary learning

A novel structure which combines the advantages of ratio mask (RM) and joint dictionary learning (JDL) is proposed for single-channel speech enhancement in this paper. The novel speech enhancement structure makes full use of the training data and overcomes some shortcomings of generative dictionary learning (GDL) algorithm. RMs of speech and interferer are introduced to provide the discriminative information both in the training stage and enhancement stage of the novel structure. In the training stage, the signals and their corresponding ideal RMs (IRMs) are used to learn the signal and IRM dictionaries jointly by K-SVD algorithm. In the enhancement stage, the mixture signal and mixture RM are sparsely represented over the composite dictionaries composed of the learned signal and IRM dictionaries to formulate a joint sparse coding (JSC) problem. Then, the estimated RMs (ERMs) of speech and interferer in the mixture are calculated to develop two soft mask (SM) filters. The proposed SM filters incorporate ideal binary mask technique and Wiener-type filter to make full use of the discriminative information provided by the ERMs. They are used to both strengthen the speech and suppress the interferer in the mixture. The proposed algorithms have shown their abilities to improve both speech intelligibility and quality. Experimental evaluations verify the proposed algorithms obtain comparable performances to a deep neural network (DNN) based mask estimator with lower computation and perform better than other tested algorithms.

[1]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[2]  Joachim M. Buhmann,et al.  Speech enhancement with sparse coding in learned dictionaries , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Philipos C. Loizou,et al.  A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Israel Cohen,et al.  Speech enhancement for non-stationary noise environments , 2001, Signal Process..

[5]  DeLiang Wang,et al.  Binary and ratio time-frequency masks for robust speech recognition , 2006, Speech Commun..

[6]  Joachim M. Buhmann,et al.  Speech Enhancement Using Generative Dictionary Learning , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[8]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[9]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[10]  DeLiang Wang,et al.  Reconstruction techniques for improving the perceptual quality of binary masked speech. , 2014, The Journal of the Acoustical Society of America.

[11]  Michael Small,et al.  Extension of the local subspace method to enhancement of speech with colored noise , 2008, Signal Process..

[12]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[13]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[15]  Jonathan Le Roux,et al.  Non-negative source-filter dynamical system for speech enhancement , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Zhongfu Ye,et al.  A Compressed Sensing Approach to Blind Separation of Speech Mixture Based on a Two-Layer Sparsity Model , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  P. Loizou,et al.  Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[19]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[20]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[21]  Yi Hu,et al.  A generalized subspace approach for enhancing speech corrupted by colored noise , 2003, IEEE Trans. Speech Audio Process..

[22]  Yang Lu,et al.  A geometric approach to spectral subtraction , 2008, Speech Commun..

[23]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[24]  Michael Elad,et al.  Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit , 2008 .

[25]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Larry S. Davis,et al.  Learning a discriminative dictionary for sparse coding via label consistent K-SVD , 2011, CVPR 2011.

[27]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[28]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[29]  Pascal Scalart,et al.  Improved Signal-to-Noise Ratio Estimation for Speech Enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[31]  Baoxin Li,et al.  Discriminative K-SVD for dictionary learning in face recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Hakan Erdogan,et al.  Discriminative nonnegative dictionary learning using cross-coherence penalties for single channel source separation , 2013, INTERSPEECH.

[33]  Michael Zibulevsky,et al.  Underdetermined blind source separation using sparse representations , 2001, Signal Process..

[34]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[35]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[36]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.