A Joint Learning Algorithm for Complex-Valued T-F Masks in Deep Learning-Based Single-Channel Speech Enhancement Systems

This paper presents a joint learning algorithm for complex-valued time-frequency (T-F) masks in single-channel speech enhancement systems. Most speech enhancement algorithms operating in a single-channel microphone environment aim to enhance the magnitude component in a T-F domain, while the input noisy phase component is used directly without any processing. Consequently, the mismatch between the processed magnitude and the unprocessed phase degrades the sound quality. To address this issue, a learning method of targeting a T-F mask that is defined in a complex domain has recently been proposed. However, due to a wide dynamic range and an irregular spectrogram pattern of the complex-valued T-F mask, the learning process is difficult even with a large-scale deep learning network. Moreover, the learning process targeting the T-F mask itself does not directly minimize the distortion in spectra or time domains. In order to address these concerns, we focus on three issues: 1) an effective estimation of complex numbers with a wide dynamic range; 2) a learning method that is directly related to speech enhancement performance; and 3) a way to resolve the mismatch between the estimated magnitude and phase spectra. In this study, we propose objective functions that can solve each of these issues and train the network by minimizing them with a joint learning framework. The evaluation results demonstrate that the proposed learning algorithm achieves significant performance improvement in various objective measures and subjective preference listening test.

[1]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Yuuki Tachioka,et al.  Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[5]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[7]  Jinkyu Lee,et al.  Deep bi-directional long short-term memory based speech enhancement for wind noise reduction , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[8]  Timo Gerkmann,et al.  STFT Phase Reconstruction in Voiced Speech for an Improved Single-Channel Speech Enhancement , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  I. Cohen Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator , 2002, IEEE Signal Processing Letters.

[10]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[12]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[13]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[14]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[15]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Jonathan Le Roux,et al.  Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction , 2008, SAPA@INTERSPEECH.

[17]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[18]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[24]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Ivan Tashev,et al.  Data driven suppression rule for speech enhancement , 2013, 2013 Information Theory and Applications Workshop (ITA).

[26]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Lonce L. Wyse,et al.  Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[31]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[32]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[34]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Norbert Wiener,et al.  Extrapolation, Interpolation, and Smoothing of Stationary Time Series , 1964 .

[36]  DeLiang Wang,et al.  On the optimality of ideal binary time-frequency masks , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[38]  Hong-Goo Kang,et al.  Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement , 2018, IEEE Signal Processing Letters.

[39]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[40]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.