CMGAN: Conformer-based Metric GAN for Speech Enhancement

Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech signal. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for SE in the time-frequency (TF) domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information by modeling both time and frequency dependencies. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech. In addition, a metric discriminator is employed to further improve the quality of the enhanced estimated speech by optimizing the generator with respect to a corresponding evaluation score. Quantitative analysis on Voice Bank+DEMAND dataset indicates the capability of CMGAN in outperforming various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB.

[1]  Sergiy Matusevych,et al.  Icassp 2022 Deep Noise Suppression Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  C. Zheng,et al.  Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Chengshi Zheng,et al.  Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement , 2021, Applied Acoustics.

[4]  Pengyuan Zhang,et al.  DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Eesung Kim,et al.  SE-Conformer: Time-Domain Speech Enhancement Using Conformer , 2021, Interspeech.

[6]  Yu Tsao,et al.  Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement , 2021, Interspeech.

[7]  Zhong-Qiu Wang,et al.  On the Compensation Between Magnitude and Phase in Speech Separation , 2021, IEEE Signal Processing Letters.

[8]  Yu Tsao,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[9]  Weiping Zhu,et al.  TSTNN: Two-Stage Transformer Based Neural Network for Speech Enhancement in the Time Domain , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Bin Yang,et al.  Investigating Cross-Domain Losses for Speech Enhancement , 2020, 2021 29th European Signal Processing Conference (EUSIPCO).

[11]  Sebastian Braun,et al.  A consolidated view of loss functions for supervised deep learning-based speech enhancement , 2020, 2021 44th International Conference on Telecommunications and Signal Processing (TSP).

[12]  Jinyu Li,et al.  Continuous Speech Separation with Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bin Yang,et al.  AeGAN: Time-Frequency Speech Denoising via Generative Adversarial Networks , 2019, 2020 28th European Signal Processing Conference (EUSIPCO).

[14]  Dong Liu,et al.  Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation , 2020, INTERSPEECH.

[15]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[16]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[17]  DeLiang Wang,et al.  Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in The Time Domain , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Zhiwei Xiong,et al.  PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network , 2019, AAAI.

[19]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[20]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[21]  Tillman Weyde,et al.  Improved Speech Enhancement with the Wave-U-Net , 2018, ArXiv.

[22]  John R. Hershey,et al.  Exploring Tradeoffs in Models for Low-Latency Speech Enhancement , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[23]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[26]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[29]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[30]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[32]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[33]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[35]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[36]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[38]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).