论文信息 - AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks

AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks

Automatic speech recognition (ASR) systems are of vital importance nowadays in commonplace tasks such as speech-to-text processing and language translation. This created the need for an ASR system that can operate in realistic crowded environments. Thus, speech enhancement is a valuable building block in ASR systems and other applications such as hearing aids, smartphones and teleconferencing systems. In this paper, a generative adversarial network (GAN) based framework is investigated for the task of speech enhancement, more specifically speech denoising of audio tracks. A new architecture based on CasNet generator and an additional feature-based loss are incorporated to get realistically denoised speech phonetics. Finally, the proposed framework is shown to outperform other learning and traditional model-based speech enhancement approaches.

Bin Yang | Karim Armanious | Sherif Abdulatif | Karim Guirguis | Jayasankar T. Sajeev

[1] Pascal Scalart,et al. Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2] Sridha Sridharan,et al. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms , 2010, INTERSPEECH.

[3] DeLiang Wang,et al. Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design , 2008 .

[4] Yu Tsao,et al. Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[5] Richard M. Schwartz,et al. Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[6] Masakiyo Fujimoto,et al. Noise suppression with unsupervised joint speaker adaptation and noise mixture model estimation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] James R. Glass,et al. Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Chris Donahue,et al. Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Jesper Jensen,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[11] P. Loizou,et al. Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. , 2008, The Journal of the Acoustical Society of America.

[12] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14] Chin-Hui Lee,et al. A Hybrid Approach to Combining Conventional and Deep Learning Techniques for Single-Channel Speech Enhancement and Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Antonio Bonafonte,et al. SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[16] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[17] Yi Hu,et al. Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19] Bin Yang,et al. Unsupervised Medical Image Translation Using Cycle-MedGAN , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[20] Philipos C. Loizou,et al. Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum , 2005, IEEE Transactions on Speech and Audio Processing.

[21] Urs Schneider,et al. An Adversarial Super-Resolution Remedy for Radar Design Trade-offs , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[22] Philipos C. Loizou,et al. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[24] Björn W. Schuller,et al. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[25] Bin Yang,et al. MedGAN: Medical Image Translation using GANs , 2018, Comput. Medical Imaging Graph..

[26] Bin Yang. A study of inverse short-time fourier transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Xavier Serra,et al. A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Gerald Enzner,et al. Bayesian MMSE Filtering of Noisy Speech by SNR Marginalization With Global PSD Priors , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.