Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network

The success of time-frequency (T-F) mask-based approaches is dependent on the accuracy of predicted mask given the noisy spectral features. The state-of-the-art methods in T- F masking-based enhancement employ Deep Neural Network (DNN) to predict mask. Recently, Generative Adversarial Networks (GAN) are gaining popularity instead of maximum likelihood (ML)-based optimization of deep learning architectures. In this paper, we propose to exploit GAN in T-F masking-based enhancement framework. We present the viable strategy to use GAN in such application by modifying the existing approach. To achieve this, we use a method that learns the mask implicitly while predicting the clean T-F representation. Moreover, we show the failure of vanilla GAN in predicting the accurate mask and propose a regularized objective function with the use of Mean Square Error (MSE) between predicted and target spectrum to overcome it. The objective evaluation of the proposed method shows the improvement in the accurate mask prediction, as against the state-of-the-art ML-based optimization techniques. The proposed system significantly improves over a recent GAN-based speech enhancement system in improving speech quality, while maintaining a better trade-off between less speech distortion and more effective removal of background interferences present in the noisy mixture.

[1]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[2]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hirokazu Kameoka,et al.  Generative Adversarial Network-Based Postfilter for STFT Spectrograms , 2017, INTERSPEECH.

[4]  Q. Fu,et al.  Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. , 2005, The Journal of the Acoustical Society of America.

[5]  DeLiang Wang,et al.  A deep neural network for time-domain signal reconstruction , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[7]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[8]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[9]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[10]  Hakan Erdogan,et al.  Deep neural networks for single channel source separation , 2013, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[12]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[16]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[17]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  DeLiang Wang,et al.  Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[19]  Tim Brookes,et al.  On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis , 2014 .

[20]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Hirokazu Kameoka,et al.  Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks , 2017, INTERSPEECH.

[22]  DeLiang Wang,et al.  A structure-preserving training target for supervised speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.