SEGAN: Speech Enhancement Generative Adversarial Network

Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.

[1]  References , 1971 .

[2]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[3]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[4]  Jae S. Lim,et al.  The unimportance of phase in speech enhancement , 1982 .

[5]  Alex Waibel,et al.  Noise reduction using connectionist models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  George Carayannis,et al.  Speech enhancement from noise: A regenerative approach , 1991, Speech Commun..

[7]  Yariv Ephraim,et al.  Statistical-model-based speech enhancement systems , 1992, Proc. IEEE.

[8]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[9]  Schuyler Quackenbush,et al.  Objective measures of speech quality , 1995 .

[10]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Javier Ortega-Garcia,et al.  Overview of speech enhancement techniques for automatic speaker recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Phil D. Green,et al.  Speech enhancement with missing data techniques using recurrent neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Q. Fu,et al.  Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. , 2005, The Journal of the Acoustical Society of America.

[14]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[15]  Yifan Gong,et al.  A minimum-mean-square-error noise reduction algorithm on Mel-frequency cepstra for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Kuldip K. Paliwal,et al.  The importance of phase in speech enhancement , 2011, Speech Commun..

[18]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[19]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[20]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[21]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[22]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[23]  Björn W. Schuller,et al.  Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR , 2015, LVA/ICA.

[24]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[27]  Xin Wang,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016 .

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[30]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[32]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[33]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).