论文信息 - A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent highdimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.

[1] Radu Horaud,et al. Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Yoshua Bengio,et al. Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[3] Laurent Girin,et al. Dynamical Variational Autoencoders: A Comprehensive Review , 2020, Found. Trends Mach. Learn..

[4] Andries P. Hekstra,et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5] Uri Shalit,et al. Deep Kalman Filters , 2015, ArXiv.

[6] Laurent Girin,et al. Notes on the use of variational autoencoders for speech and audio spectrogram modeling , 2019 .

[7] Jordi Bonada,et al. Modeling and Transforming Speech Using Variational Autoencoders , 2016, INTERSPEECH.

[8] Ole Winther,et al. A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning , 2017, NIPS.

[9] Yutaka Matsuo,et al. Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder , 2018, INTERSPEECH.

[10] Laurent Girin,et al. A Recurrent Variational Autoencoder for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Ole Winther,et al. Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[12] Tatsuya Kawahara,et al. Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Yu Zhang,et al. Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[14] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[15] Vinay P. Namboodiri,et al. Monoaural Audio Source Separation Using Variational Autoencoders , 2018, INTERSPEECH.

[16] Yu Zhang,et al. Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[17] Dan Geiger,et al. Identifying independence in bayesian networks , 1990, Networks.

[18] Radu Horaud,et al. Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Uri Shalit,et al. Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[20] Yu Tsao,et al. Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[21] Ali Taylan Cemgil,et al. Nonnegative matrix factorizations as probabilistic inference in composite models , 2009, 2009 17th European Signal Processing Conference.

[22] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[23] Richard E. Turner,et al. Neural Adaptive Sequential Monte Carlo , 2015, NIPS.

[24] Jesper Jensen,et al. A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] Ephraim. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[26] Stephan Mandt,et al. Disentangled Sequential Autoencoder , 2018, ICML.

[27] Kazuyoshi Yoshii,et al. A Deep Generative Model of Speech Complex Spectrograms , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Christian Osendorfer,et al. Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[29] Yoshua Bengio,et al. A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[30] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31] Nancy Bertin,et al. Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis , 2009, Neural Computation.

[32] Antoine Liutkus,et al. Gaussian Processes for Underdetermined Source Separation , 2011, IEEE Transactions on Signal Processing.

[33] Radu Horaud,et al. A VARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT , 2018, 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP).

[34] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35] Yang Yang,et al. Feedback Recurrent Autoencoder , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.