Cold Diffusion for Speech Enhancement

Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties of the sampling process from cold diffusion could be utilized to restore high-quality samples from arbitrary degradations. Based on these properties, we propose an improved training algorithm and objective to help the model generalize better during the sampling process. We verify our proposed framework by investigating two model architectures. Experimental results on benchmark speech enhancement dataset VoiceBank-DEMAND demonstrate the strong performance of the proposed approach compared to representative discriminative models and diffusion-based enhancement models.

[1]  Furong Huang,et al.  Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise , 2022, NeurIPS.

[2]  Santiago Pascual,et al.  Universal Speech Enhancement with Score-based Diffusion , 2022, ArXiv.

[3]  Timo Gerkmann,et al.  Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain , 2022, INTERSPEECH.

[4]  Alexander Richard,et al.  Conditional Diffusion Probabilistic Model for Speech Enhancement , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Lei Xie,et al.  S-DCCRN: Super Wide Band DCCRN with Learnable Complex Feature for Speech Enhancement , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Xavier Alameda-Pineda,et al.  Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Yu Tsao,et al.  A Study on Speech Enhancement Based on Diffusion Probabilistic Model , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[8]  Bernd Edler,et al.  A Flow-Based Neural Network for Time Domain Speech Enhancement , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Timo Gerkmann,et al.  Disentanglement Learning for Variational Autoencoders Applied to Audio-Visual Speech Enhancement , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[10]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[11]  Stefan Wermter,et al.  Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[13]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[14]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[15]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[16]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[17]  Yu Tsao,et al.  WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-End Speech Enhancement , 2020, IEEE Signal Processing Letters.

[18]  DeLiang Wang,et al.  Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  DeLiang Wang,et al.  The optimal threshold for removing noise from speech is similar across normal and impaired hearing-a time-frequency masking study. , 2019, The Journal of the Acoustical Society of America.

[20]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[21]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[23]  Mira Lilleholt Vik Speech Enhancement with a Generative Adversarial Network , 2019 .

[24]  DeLiang Wang,et al.  A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[25]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[26]  Cassia Valentini-Botinhao,et al.  Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[27]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[28]  DeLiang Wang,et al.  Deep learning reinvents the hearing aid , 2017, IEEE Spectrum.

[29]  Jonathan Le Roux,et al.  Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[30]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[31]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.

[33]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[34]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[38]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[39]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).