VoiceGrad: Non-Parallel Any-to-Many Voice Conversion With Annealed Langevin Dynamics

In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predict the gradient of the log density of the speech feature sequences of multiple speakers, and performs VC by using annealed Langevin dynamics to iteratively update an input feature sequence towards the nearest stationary point of the target distribution based on the trained score approximator network. Thanks to the nature of this concept, VoiceGrad enables any-to-many VC, a VC scenario in which the speaker of input speech can be arbitrary, and allows for non-parallel training, which requires no parallel utterances or transcriptions.

[1]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Kou Tanaka,et al.  ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Hirokazu Kameoka,et al.  ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion , 2018, ArXiv.

[6]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[9]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hyunsoo Kim,et al.  Learning to Discover Cross-Domain Relations with Generative Adversarial Networks , 2017, ICML.

[11]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[12]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[13]  Kou Tanaka,et al.  ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Kou Tanaka,et al.  Synthetic-to-Natural Speech Waveform Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  Joan Serra,et al.  Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion , 2019, NeurIPS.

[16]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[17]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[19]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[20]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[21]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Tomoki Toda,et al.  Speaker-Dependent WaveNet Vocoder , 2017, INTERSPEECH.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Kou Tanaka,et al.  Nonparallel Voice Conversion With Augmented Classifier Star Generative Adversarial Networks , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[27]  Hirokazu Kameoka,et al.  Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[28]  Kou Tanaka,et al.  ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[30]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[31]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[32]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[33]  Tomoki Toda,et al.  Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech , 2012, Speech Commun..

[34]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[35]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[37]  Yoshua Bengio,et al.  SampleRNN: An Unconditional End-to-End Neural Audio Generation Model , 2016, ICLR.

[38]  Alan W. Black,et al.  The CMU Arctic speech databases , 2004, SSW.

[39]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[40]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[42]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[44]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[45]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[46]  Xin Wang,et al.  Neural Source-filter-based Waveform Model for Statistical Parametric Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Kou Tanaka,et al.  Many-to-Many Voice Transformer Network , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[49]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[50]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Peter Jax,et al.  Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[52]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[53]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[54]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[55]  John-Paul Hosom,et al.  Improving the intelligibility of dysarthric speech , 2007, Speech Commun..

[56]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[57]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[58]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[59]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[60]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[61]  Ricardo Gutierrez-Osuna,et al.  Foreign accent conversion in computer assisted pronunciation training , 2009, Speech Commun..

[62]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.