Segmented Time-Frequency Masking Algorithm for Speech Separation Based on Deep Neural Networks

In view of the residual problem of speech background noise in supervised model based single-channel speech separation algorithm in non-stationary noise environments, a piecewise time-frequency masking target based on Wiener filtering principle is proposed and used as the training target of neural network, which can not only track the SNR changes, but also reduce the damage to speech quality. By combing the four features of Relative spectral transform and perceptual linear prediction (RASTA-PLP) + amplitude modulation spectrogram (AMS) + Mel-frequency cepstral coefficients (MFCC) + Gammatone frequency cepstral coefficient (GFCC), the extracted multi-level voice information is used as the training features of the neural network, and then a deep neural network (DNN) based speech separation system is constructed to separate the noisy speech. The experimental results show that: compared with traditional time-frequency masking methods, the segmented time-frequency masking algorithm can improve the speech quality and clarity, and achieves the purpose of noise suppression and better speech separation performance at low SNR.

[1]  IEEE Recommended Practice for Speech Quality Measurements , 1969, IEEE Transactions on Audio and Electroacoustics.

[2]  Wei Jiang,et al.  The analysis of the simplification from the ideal ratio to binary mask in signal-to-noise ratio sense , 2014, Speech Commun..

[3]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  Franz Pernkopf,et al.  On linear and mixmax interaction models for single channel source separation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[8]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[10]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[15]  DeLiang Wang,et al.  A Feature Study for Masking-Based Reverberant Speech Separation , 2016, INTERSPEECH.

[16]  Philipos C. Loizou,et al.  A new binary mask based on noise constraints for improved speech intelligibility , 2010, INTERSPEECH.