Universal Speech Enhancement with Score-based Diffusion

Removing background noise from speech audio has been the subject of considerable effort, especially in recent years due to the rise of virtual communication and amateur recordings. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

[1]  Timo Gerkmann,et al.  Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain , 2022, INTERSPEECH.

[2]  Alexander Richard,et al.  Conditional Diffusion Probabilistic Model for Speech Enhancement , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  A. Finkelstein,et al.  HiFi-GAN-2: Studio-Quality Speech Enhancement via Generative Adversarial Networks Conditioned on Acoustic Features , 2021, 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[4]  DeLiang Wang,et al.  VoiceFixer: Toward General Speech Restoration With Neural Vocoder , 2021, ArXiv.

[5]  Hui Wang,et al.  A Two-stage Complex Network using Cycle-consistent Generative Adversarial Networks for Speech Enhancement , 2021, Speech Commun..

[6]  Eesung Kim,et al.  SE-Conformer: Time-Domain Speech Enhancement Using Conformer , 2021, Interspeech.

[7]  Yu Tsao,et al.  A Study on Speech Enhancement Based on Diffusion Probabilistic Model , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[8]  Zhou Zhao,et al.  WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution , 2021, Interspeech.

[9]  Lei Xie,et al.  DCCRN+: Channel-wise Subband DCCRN with SNR Estimation for Speech Enhancement , 2021, Interspeech.

[10]  Gaetan Hadjeres,et al.  CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis , 2021, ISMIR.

[11]  Arun Asokan Nair,et al.  Cascaded Time + Time-Frequency Unet For Speech Enhancement: Jointly Addressing Clipping, Codec Distortions, And Gaps , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Bernd Edler,et al.  A Flow-Based Neural Network for Time Domain Speech Enhancement , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tasnima Sadekova,et al.  Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.

[14]  Visar Berisha,et al.  Restoring degraded speech via a modified diffusion model , 2021, Interspeech.

[15]  M. Ravanelli,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[16]  Joan Serra,et al.  On tuning consistent annealed sampling for denoising score matching , 2021, ArXiv.

[17]  Seungu Han,et al.  NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling , 2021, Interspeech.

[18]  Nam Soo Kim,et al.  Diff-TTS: A Denoising Diffusion Model for Text-to-Speech , 2021, Interspeech.

[19]  Andrew Hines,et al.  Warp-Q: Quality Prediction for Generative Neural Speech Codecs , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Lior Wolf,et al.  High Fidelity Speech Regeneration with Application to Speech Enhancement , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Xiulian Peng,et al.  Interactive Speech and Noise Modeling for Speech Enhancement , 2020, AAAI.

[22]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[23]  Radu Horaud,et al.  Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Joan Serra,et al.  Upsampling Artifacts in Neural Audio Synthesis , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Saurabh Kataria,et al.  Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Oleg Rybakov,et al.  Real-Time Speech Frequency Bandwidth Extension , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Santiago Pascual,et al.  SESQA: Semi-Supervised Learning for Speech Quality Assessment , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[29]  Ioannis Mitliagkas,et al.  Adversarial score matching and improved sampling for image generation , 2020, ICLR.

[30]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[31]  Jean-Marc Valin,et al.  PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased Loss , 2020, INTERSPEECH.

[32]  Gabriel Synnaeve,et al.  Real Time Speech Enhancement in the Waveform Domain , 2020, INTERSPEECH.

[33]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[34]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[35]  Chandan K. A. Reddy,et al.  The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results , 2020, INTERSPEECH.

[36]  Cong Zhou,et al.  Source Coding of Audio Signals with a Generative Model , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Maarten De Vos,et al.  Improving GANs for Speech Enhancement , 2020, IEEE Signal Processing Letters.

[38]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[39]  Michael I. Mandel,et al.  Speaker Independence of Neural Vocoders and Their Effect on Parametric Resynthesis Speech Enhancement , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) , 2019 .

[41]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Emanuël A. P. Habets,et al.  Declipping Speech Using Deep Filtering , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[43]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[44]  Michael I. Mandel,et al.  Parametric Resynthesis With Neural Vocoders , 2019, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[45]  Adam Finkelstein,et al.  Perceptually-motivated Environment-specific Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Arno Solin,et al.  Applied Stochastic Differential Equations , 2019 .

[47]  Antonio Bonafonte,et al.  Towards Generalized Speech Enhancement with Generative Adversarial Networks , 2019, INTERSPEECH.

[48]  Daniel P. W. Ellis,et al.  Learning Sound Event Classifiers from Web Audio with Noisy Labels , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Jong Wook Kim,et al.  Crepe: A Convolutional Representation for Pitch Estimation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[51]  Cassia Valentini-Botinhao,et al.  Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[52]  Xavier Serra,et al.  A Wavenet for Speech Denoising , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[54]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[55]  Simon King,et al.  Repeated Harvard Sentence Prompts corpus version 0.5 , 2014 .

[56]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[57]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[59]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[60]  B. Anderson Reverse-time diffusion equation models , 1982 .

[61]  Chengshi Zheng,et al.  Two Heads are Better Than One: A Two-Stage Complex Spectral Mapping Approach for Monaural Speech Enhancement , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[62]  Eric T. Nalisnick,et al.  Under review as a conference paper at ICLR 2016 , 2015 .

[63]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[64]  S. Srihari Mixture Density Networks , 1994 .

[65]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  A. N. Kolmogorov,et al.  Interpolation and extrapolation of stationary random sequences. , 1962 .

[67]  Sidheswar Routray,et al.  Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network , 2022, Comput. Speech Lang..