Study of GANs for Noisy Speech Simulation from Clean Speech

The performance of speech processing models trained on clean speech drops significantly in noisy conditions. Training with noisy datasets alleviates the problem, but procuring such datasets is not always feasible. Noisy speech simulation models that generate noisy speech from clean speech help remedy this issue. In our work, we study the ability of Generative Adversarial Networks (GANs) to simulate a variety of noises. Noise from the Ultra-High-Frequency/Very-High-Frequency (UHF/VHF), additive stationary and non-stationary, and codec distortion categories are studied. We propose four GANs, including the non-parallel translators, SpeechAttentionGAN, SimuGAN, and MaskCycleGAN-Augment, and the parallel translator, Speech2Speech-Augment. We achieved improvements of 55.8%, 28.9%, and 22.8% in terms of Multi-Scale Spectral Loss (MSSL) as compared to the baseline for the RATS, TIMIT-Cabin, and TIMIT-Helicopter datasets, respectively, after training on small datasets of about 3 minutes.

[1]  Nicu Sebe,et al.  AttentionGAN: Unpaired Image-to-Image Translation Using Attention-Guided Generative Adversarial Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Chng Eng Siong,et al.  DENT-DDSP: Data-efficient noisy speech generator using differentiable digital signal processors for explicit distortion modelling and noise-robust speech recognition , 2022, INTERSPEECH.

[3]  Chng Eng Siong,et al.  Noise-Robust Speech Recognition With 10 Minutes Unparalleled In-Domain Data , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yu Tsao,et al.  Unsupervised Noise Adaptive Speech Enhancement by Discriminator-Constrained Optimal Transport , 2021, NeurIPS.

[5]  Hamsa Balakrishnan,et al.  Automatic Speech Recognition for Air Traffic Control Communications , 2021, Transportation Research Record: Journal of the Transportation Research Board.

[6]  Hirokazu Kameoka,et al.  Maskcyclegan-VC: Learning Non-Parallel Voice Conversion with Filling in Frames , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  N. Kehtarnavaz,et al.  Improving deep speech denoising by Noisy2Noisy signal mapping , 2019, Applied Acoustics.

[8]  Alexei A. Efros,et al.  Contrastive Learning for Unpaired Image-to-Image Translation , 2020, ECCV.

[9]  Chenjie Gu,et al.  DDSP: Differentiable Digital Signal Processing , 2020, ICLR.

[10]  Chng Eng Siong,et al.  Improving code-switching speech recognition with data augmentation and system combination , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Chng Eng Siong,et al.  Audio Codec Simulation based Data Augmentation for Telephony Speech Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[12]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[13]  Nicu Sebe,et al.  Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[14]  François Lancelot,et al.  The Airbus Air Traffic Control speech recognition 2018 challenge: towards ATC automatic transcription and call sign detection , 2018, INTERSPEECH.

[15]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Petr Motlícek,et al.  A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition , 2016, IEEE Signal Processing Letters.

[18]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[22]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[23]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[24]  Chin-Hui Lee,et al.  Robust speech recognition based on stochastic matching , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.