Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.

[1]  Yu Tsao,et al.  MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hui Wang,et al.  CycleGAN-based Non-parallel Speech Enhancement with an Adaptive Attention-in-attention Mechanism , 2021, 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[3]  Laurent Girin,et al.  A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling , 2021, Interspeech.

[4]  S. Natarajan,et al.  Speech Denoising without Clean Training Data: a Noise2Noise Approach , 2021, Interspeech.

[5]  Yu Tsao,et al.  MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement , 2021, Interspeech.

[6]  Stefan Wermter,et al.  Variational Autoencoder for Speech Enhancement with a Noise-Aware Encoder , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Timo Gerkmann,et al.  Guided Variational Autoencoder for Speech Enhancement with a Supervised Classifier , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Xavier Alameda-Pineda,et al.  Switching Variational Auto-Encoders for Noise-Agnostic Audio-Visual Speech Enhancement , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  K. Yatabe,et al.  Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech , 2021, 2021 29th European Signal Processing Conference (EUSIPCO).

[10]  Ross Cutler,et al.  Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Laurent Girin,et al.  Dynamical Variational Autoencoders: A Comprehensive Review , 2020, Found. Trends Mach. Learn..

[12]  Xavier Alameda-Pineda,et al.  Mixture of Inference Networks for VAE-Based Audio-Visual Speech Enhancement , 2019, IEEE Transactions on Signal Processing.

[13]  N. Kehtarnavaz,et al.  Improving deep speech denoising by Noisy2Noisy signal mapping , 2019, Applied Acoustics.

[14]  Kazuyoshi Yoshii,et al.  Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder , 2020, INTERSPEECH.

[15]  Timo Gerkmann,et al.  Speech Enhancement with Stochastic Temporal Convolutional Networks , 2020, INTERSPEECH.

[16]  Jan Kautz,et al.  NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[17]  S. Uhlich,et al.  Open-Unmix for Speech Enhancement (UMX SE) , 2020 .

[18]  R. Horaud,et al.  A Recurrent Variational Autoencoder for Speech Enhancement , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yang Xiang,et al.  A Parallel-Data-Free Speech Enhancement Method Using Multi-Objective Learning Cycle-Consistent Generative Adversarial Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..

[21]  Antoine Liutkus,et al.  Cauchy Multichannel Speech Enhancement with a Deep Speech Prior , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[22]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[23]  Emmanuel Vincent,et al.  A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders , 2019, INTERSPEECH.

[24]  Otmar Hilliges,et al.  STCN: Stochastic Temporal Convolutional Networks , 2019, ICLR.

[25]  Radu Horaud,et al.  Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Radu Horaud,et al.  Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Tatsuya Kawahara,et al.  Bayesian Multichannel Speech Enhancement with a Deep Speech Prior , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[29]  Radu Horaud,et al.  A VARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[30]  Emmanuel Vincent,et al.  Audio Source Separation and Speech Enhancement , 2018 .

[31]  Jaakko Lehtinen,et al.  Noise2Noise: Learning Image Restoration without Clean Data , 2018, ICML.

[32]  Stephan Mandt,et al.  Disentangled Sequential Autoencoder , 2018, ICML.

[33]  Tatsuya Kawahara,et al.  Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Pierre-Alexandre Mattei,et al.  Refit your Encoder when New Data Comes by , 2018 .

[36]  Ole Winther,et al.  A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning , 2017, NIPS.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Emmanuel Vincent,et al.  A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[39]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[40]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[41]  Uri Shalit,et al.  Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[42]  Junichi Yamagishi,et al.  Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[43]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[44]  Gregory D. Hager,et al.  Temporal Convolutional Networks: A Unified Approach to Action Segmentation , 2016, ECCV Workshops.

[45]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[46]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[47]  Uri Shalit,et al.  Deep Kalman Filters , 2015, ArXiv.

[48]  Sridha Sridharan,et al.  The QUT-NOISE-SRE protocol for the evaluation of noisy speaker recognition , 2015, INTERSPEECH.

[49]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[50]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Diederik P. Kingma,et al.  Variational Recurrent Auto-Encoders , 2014, ICLR.

[53]  Christian Osendorfer,et al.  Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[54]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[55]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[56]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[57]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[58]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[59]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[60]  DeLiang Wang,et al.  Boosting Classification Based Speech Separation Using Temporal Dynamics , 2012, INTERSPEECH.

[61]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[63]  Antoine Liutkus,et al.  Gaussian Processes for Underdetermined Source Separation , 2011, IEEE Transactions on Signal Processing.

[64]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .

[66]  Mark D. Plumbley,et al.  Probabilistic Modeling Paradigms for Audio Source Separation , 2010 .

[67]  Nancy Bertin,et al.  Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[68]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[69]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[70]  Bhiksha Raj,et al.  Supervised and Semi-supervised Separation of Sounds from Single-Channel Mixtures , 2007, ICA.

[71]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[72]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[73]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[74]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[75]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[76]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[77]  James L. Massey,et al.  Proper complex random processes with applications to information theory , 1993, IEEE Trans. Inf. Theory.

[78]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[79]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[80]  Jae S. Lim,et al.  Speech enhancement , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[81]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[82]  A.V. Oppenheim,et al.  Enhancement and bandwidth compression of noisy speech , 1979, Proceedings of the IEEE.

[83]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[84]  J. Tukey,et al.  Variations of Box Plots , 1978 .