Monaural speech separation using GA-DNN integration scheme

Abstract In this research work, we propose the model based on the Genetic Algorithm (GA) and Deep Neural Network (DNN) to enhance the quality and intelligibility of the noisy speech. In this proposed model, the Voiced Speech (VS) T-F mask is computed using correlogram, frame energy and cross-channel correlogram and Unvoiced Speech (UVS) T-F mask is computed using speech onset/offset. The T-F mask obtained using speech onset and offset represents both voiced and unvoiced segment of the noisy speech signal. The UVS T-F mask is obtained by subtracting the VS from the T-F mask obtained earlier using speech onset/offset. Next, the GA is used to find the optimum weight to combine the T-F mask of VS and UVS to improve speech quality and intelligibility. The weight obtained using GA may not be an optimum one for all sets of speech and noise. This research work focuses on this issue and proposes a DNN model to estimate the optimum weight for all sets of speech and noise. The DNN model is trained using features and optimum weight obtained using GA. Later, the trained DNN model is used to estimate the optimum weight for the testing speech and noise samples. The performance of the proposed GA-DNN based model is evaluated using objective and subjective quality and intelligibility measures. The results of the proposed model shows a prompt improvement in the speech quality and intelligibility with average of 0.73, 4.07, 0.17, 0.26 and 0.22 for PESQ, SNR, STOI, CSII and NCM when compared with the existing speech separation systems.

[1]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[2]  Oh-Wook Kwon,et al.  Single-channel speech separation using phase-based methods , 2010, IEEE Transactions on Consumer Electronics.

[3]  Jean Rouat,et al.  A Quantitative Evaluation of a Bio-inspired Sound Segregation Technique for Two- and Three-Source Mixtures , 2004, Summer School on Neural Networks.

[4]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[5]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  DeLiang Wang,et al.  Unvoiced Speech Segregation From Nonspeech Interference via CASA and Spectral Subtraction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Brian C J Moore,et al.  Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants. , 2019, The Journal of the Acoustical Society of America.

[8]  S. Shoba,et al.  Image Processing Techniques for Segments Grouping in Monaural Speech Separation , 2018 .

[9]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Guoning Hu,et al.  Monaural speech organization and segregation , 2006 .

[11]  Oh-Wook Kwon,et al.  Application of shape analysis techniques for improved CASA-based speech separation , 2009, IEEE Transactions on Consumer Electronics.

[12]  Jun Wang,et al.  Nonlinear blind source separation using a genetic algorithm , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[13]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[14]  DeLiang Wang,et al.  Auditory Segmentation Based on Onset and Offset Analysis , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Chen Ning,et al.  Improved monaural speech segregation based on computational auditory scene analysis , 2013, EURASIP Journal on Audio, Speech, and Music Processing.

[16]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Md. Liakot Ali,et al.  Boosting Neuro Evolutionary Techniques for Speech Recognition , 2019, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[18]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  H. Belar,et al.  Speech processing techniques and applications , 1967, IEEE Transactions on Audio and Electroacoustics.

[20]  S. Shoba,et al.  Improving Speech Intelligibility in Monaural Segregation System by Fusing Voiced and Unvoiced Speech Segments , 2018 .

[21]  Shoba Sivapatham,et al.  Performance improvement of monaural speech separation system using image analysis techniques , 2018, IET Signal Process..

[22]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[23]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[24]  Indrajit Chakrabarti,et al.  Improved single channel phase-aware speech enhancement technique for low signal-to-noise ratio signal , 2016, IET Signal Process..

[25]  R Meddis,et al.  Simulation of auditory-neural transduction: further studies. , 1988, The Journal of the Acoustical Society of America.

[26]  DeLiang Wang,et al.  Neural Network Based Pitch Tracking in Very Noisy Speech , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Theodore W. Berger,et al.  Isolated Word Analysis Using Biologically-Based Neural Networks , 2010 .

[29]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[30]  DeLiang Wang,et al.  TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  DeLiang Wang,et al.  An Auditory Scene Analysis Approach to Monaural Speech Segregation , 2006 .

[32]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[34]  Paris Smaragdis,et al.  Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.