Multi-scale Generative Adversarial Networks for Speech Enhancement

The generative adversarial networks can be used to recognize and eliminate noise from noisy speech after extensive training. The most representative model is Speech Enhancement Generative Adversarial Network (SEGAN). However, eliminating the noise without distortion is still a challenging task especially in a low SNR environment. To solve such problems, this paper proposes Speech Enhancement Multi-scale Generative Adversarial Networks (SEMGAN), whose generator and discriminator networks are structured on the basis of fully convolutional neural networks (FCNNs). Compared with SEGAN, the generator generates speeches in three different dimensions and makes multiple judgments in the discriminator. In addition, multiple types of noise and signal-noise ratios (SNRs) are used to train our model for improving the generalization capability. In the stage of testing, we further propose pre- SEMGAN, which solve the problem that the last frame of speech data was not processed well. As the experimental results indicated, the architecture (SEMGAN and pre- SEMGAN) proposed gain a superior performance in comparison with the optimally modified log-spectral amplitude estimator (OMLSA) and SEGAN in different noisy conditions. It is worth mentioning that SEMGAN's PESQ and STOI score increase about 7% and 3.6% over SEGAN respectively in the case of 2.5 dB SNR.

[1]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[2]  Weixia Zou,et al.  Speech Enhancement Based on A New Architecture of Wasserstein Generative Adversarial Networks , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[3]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[4]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  James L. McClelland,et al.  Learning Subsequential Structure in Simple Recurrent Networks , 1988, NIPS.

[6]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[7]  Jinwon Lee,et al.  A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[8]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[9]  Ye Li,et al.  Speech Enhancement for Non-Stationary Noise Environments , 2009, 2009 International Conference on Information Engineering and Computer Science.

[10]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[11]  Nam Soo Kim,et al.  NMF-Based Speech Enhancement Using Bases Update , 2015, IEEE Signal Processing Letters.

[12]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[13]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[14]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[17]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[18]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[19]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Jiri Malek,et al.  Single channel speech enhancement using convolutional neural network , 2017, 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM).

[22]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Zheng-Hua Tan,et al.  Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification , 2017, INTERSPEECH.

[24]  Yang Xiang,et al.  Speech Enhancement via Generative Adversarial LSTM Networks , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.