Performance analysis of various training targets for improving speech quality and intelligibility

Abstract Denoising a single-channel speech (recorded using one microphone) remains an open problem in many speech-related applications. Recently, supervised deep learning methods are used to denoise the speech signal. This work uses Deep Neural Network (DNN) to learn the Time–Frequency (T-F) mask of the clean speech from its noisy speech features. In general, Ideal Binary Mask (IBM) is used as the binary mask training target to improve speech intelligibility, and Ideal Ratio Mask (IRM) is used as a non-binary mask training target to improve speech quality. Still, it may not necessarily be the best T-F mask to analyze the performance of improvement in speech quality/intelligibility. However, an appropriate training target remains to be unclear for supervised deep learning methods. In this work, a non-binary novel soft T-F mask named Optimum Soft Mask (OSM) is proposed, analyzed and compared with different T-F mask types used for single-channel speech denoising methods. In addition, the target T-F mask is compared with the existing state of art approaches to show a clear performance advantage of supervised deep learning models. The performance of the binary and non-binary training targets of DNN is evaluated under different Signal-to-Noise-Ratio’s and noise conditions ti improve speech quality and intelligibility. The experimental results reveal that the binary mask IBM shows significant improvement in speech intelligibility; the non-binary mask IRM shows a substantial improvement in speech quality. At the same time, the proposed novel soft T-F mask shows notable improvement in both quality and intelligibility under various test conditions.

[1]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[2]  T. Kishore Kumar,et al.  A Survey on Statistical Based Single Channel Speech Enhancement Techniques , 2014 .

[3]  Antony William Rix,et al.  Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality a , 2002 .

[4]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  DeLiang Wang,et al.  An SVM based classification approach to speech separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Jérôme Boudy,et al.  Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov models and the projection, for robust speech recognition in cars , 1991, Speech Commun..

[8]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[9]  M Dharmalingam,et al.  CASA FOR IMPROVING SPEECH INTELLIGIBILITY IN MONAURAL SPEECH SEPARATION , 2017 .

[10]  Ali Koochakzadeh,et al.  Nonnegative Matrix Factorization by optimization on the Stiefel manifold with SVD initialization , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[11]  Wenzhen Zhang,et al.  Speech Emotion Recognition Based on SVM and ANN , 2018, International Journal of Machine Learning and Computing.

[12]  M. Dharmalingam and M. C. John Wiselin CASA For Improving Speech Intelligibility in Monaural Speech Separation , 2017 .

[13]  Yu Zhang,et al.  A Fast Non-Smooth Nonnegative Matrix Factorization for Learning Sparse Representation , 2016, IEEE Access.

[14]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[15]  Fu Wang,et al.  Decision tree SVM model with Fisher feature selection for speech emotion recognition , 2019, EURASIP J. Audio Speech Music. Process..

[16]  Wai Lok Woo,et al.  Informed Single-Channel Speech Separation Using HMM–GMM User-Generated Exemplar Source , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Hamid Sheikhzadeh,et al.  HMM-based strategies for enhancement of speech signals embedded in nonstationary noise , 1998, IEEE Trans. Speech Audio Process..

[19]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[20]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Yonghong Yan,et al.  Ideal Ratio Mask Estimation Using Deep Neural Networks for Monaural Speech Segregation in Noisy Reverberant Conditions , 2017, INTERSPEECH.

[22]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[23]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[25]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[26]  Donald S. Williamson,et al.  Impact of phase estimation on single-channel speech separation based on time-frequency masking. , 2017, The Journal of the Acoustical Society of America.

[27]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Chen Ning,et al.  Improved monaural speech segregation based on computational auditory scene analysis , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[29]  DeLiang Wang,et al.  A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Thomas F. Quatieri,et al.  Speech transformations based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[31]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[32]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[33]  S. Shoba,et al.  Image Processing Techniques for Segments Grouping in Monaural Speech Separation , 2018 .

[34]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[35]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[36]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[37]  Ning Chen,et al.  Improved monaural speech segregation based on computational auditory scene analysis , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[38]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[39]  Thomas F. Quatieri,et al.  An approach to co-channel talker interference suppression using a sinusoidal model for speech , 1990, IEEE Trans. Acoust. Speech Signal Process..

[40]  Abrar Hussain,et al.  Single channel speech enhancement using ideal binary mask technique based on computational auditory scene analysis , 2016 .

[41]  S. Shoba,et al.  Improving Speech Intelligibility in Monaural Segregation System by Fusing Voiced and Unvoiced Speech Segments , 2018 .

[42]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[44]  Jesper Jensen,et al.  MMSE based noise PSD tracking with low complexity , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.