Supervised Single Channel Speech Enhancement Based on Dual-Tree Complex Wavelet Transforms and Nonnegative Matrix Factorization Using the Joint Learning Process and Subband Smooth Ratio Mask

In this paper, we propose a novel speech enhancement method based on dual-tree complex wavelet transforms (DTCWT) and nonnegative matrix factorization (NMF) that exploits the subband smooth ratio mask (ssRM) through a joint learning process. The discrete wavelet packet transform (DWPT) suffers the absence of shift invariance, due to downsampling after the filtering process, resulting in a reconstructed signal with significant noise. The redundant stationary wavelet transform (SWT) can solve this shift invariance problem. In this respect, we use efficient DTCWT with a shift invariance property and limited redundancy and calculate the ratio masks (RMs) between the clean training speech and noisy speech (i.e., training noise mixed with clean speech). We also compute RMs between the noise and noisy speech and then learn both RMs with their corresponding clean training clean speech and noise. The auto-regressive moving average (ARMA) filtering process is applied before NMF in previously generated matrices for smooth decomposition. An ssRM is proposed to exploit the advantage of the joint use of the standard ratio mask (sRM) and square root ratio mask (srRM). In short, the DTCWT produces a set of subband signals employing the time-domain signal. Subsequently, the framing scheme is applied to each subband signal to form matrices and calculates the RMs before concatenation with the previously generated matrices. The ARMA filter is implemented in the nonnegative matrix, which is formed by considering the absolute value. Through ssRM, speech components are detected using NMF in each newly formed matrix. Finally, the enhanced speech signal is obtained via the inverse DTCWT (IDTCWT). The performances are evaluated by considering an IEEE corpus, the GRID audio-visual corpus, and different types of noises. The proposed approach significantly improves objective speech quality and intelligibility and outperforms the conventional STFT-NMF, DWPT-NMF, and DNN-IRM methods.

[1]  Sung-il Jung,et al.  Speech Enhancement by Wavelet Packet Transform with Best Fitting Regression Line in Various Noise Environments , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Hanseok Ko,et al.  Single-channel speech enhancement method using reconstructive NMF with spectrotemporal speech presence probabilities , 2017 .

[3]  S.M. Shahrtash,et al.  Comparing denoising performance of DWT,WPT, SWT and DT-CWT for Partial Discharge signals , 2008, 2008 43rd International Universities Power Engineering Conference.

[4]  Chengshi Zheng,et al.  Spectral subtraction based on two-stage spectral estimation and modified cepstrum thresholding , 2013 .

[5]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[6]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Zhongfu Ye,et al.  Supervised Monaural Speech Enhancement Using Complementary Joint Sparse Representations , 2016, IEEE Signal Processing Letters.

[8]  Jeff A. Bilmes,et al.  MVA Processing of Speech Features , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  James M. Kates,et al.  The Hearing-Aid Speech Quality Index (HASQI) , 2010 .

[10]  Guanglu Sun,et al.  Spectrum enhancement with sparse coding for robust speech recognition , 2015, Digit. Signal Process..

[11]  Sanaz Seyedin,et al.  Speech enhancement using sparse dictionary learning in wavelet packet transform domain , 2017, Comput. Speech Lang..

[12]  Yu Tsao,et al.  Wavelet Speech Enhancement Based on Nonnegative Matrix Factorization , 2016, IEEE Signal Processing Letters.

[13]  R. McAulay,et al.  Speech enhancement using a soft-decision noise suppression filter , 1980 .

[14]  James M. Kates,et al.  The Hearing-Aid Speech Perception Index (HASPI) , 2014, Speech Commun..

[15]  Yasser Ghanbari,et al.  A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets , 2006, Speech Commun..

[16]  DeLiang Wang,et al.  Reconstruction techniques for improving the perceptual quality of binary masked speech. , 2014, The Journal of the Acoustical Society of America.

[17]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[18]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[19]  Muhammad Shafi,et al.  Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization , 2018 .

[20]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[21]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[22]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[23]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[24]  Mohamed Djendi,et al.  Improved subband-forward algorithm for acoustic noise reduction and speech quality enhancement , 2016, Appl. Soft Comput..

[25]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[26]  Noureddine Ellouze,et al.  Speech enhancement based on wavelet packet of an improved principal component analysis , 2016, Comput. Speech Lang..

[27]  Nick G. Kingsbury,et al.  The dual-tree complex wavelet transform: A new efficient tool for image restoration and enhancement , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[28]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[29]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Yannis Stylianou,et al.  Phase importance in speech processing applications , 2014, INTERSPEECH.

[31]  Arne Leijon,et al.  Single channel speech enhancement using Bayesian NMF with recursive temporal updates of prior distributions , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[35]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[38]  Nam Soo Kim,et al.  NMF-based Target Source Separation Using Deep Neural Network , 2015, IEEE Signal Processing Letters.

[39]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Mohamed Djendi,et al.  A wavelet-based forward BSS algorithm for acoustic noise reduction and speech enhancement , 2016 .

[41]  Daniel P. W. Ellis,et al.  Speech enhancement by sparse, low-rank, and dictionary spectrogram decomposition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[42]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[43]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[44]  Richard Baraniuk,et al.  The Dual-tree Complex Wavelet Transform , 2007 .