Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder

Voice conversion is a challenging task which transforms the voice characteristics of a source speaker to a target speaker without changing linguistic content. Recently, there have been many works on many-to-many Voice Conversion (VC) based on Variational Autoencoder (VAEs) achieving good results, however, these methods lack the ability to disentangle speaker identity and linguistic content to achieve good performance on unseen speaker’s scenarios. In this paper, we propose a new method based on feature disentanglement to tackle many-tomany voice conversion. The method has the capability to disentangle speaker identity and linguistic content from utterances, it can convert from many source speakers to many target speakers with a single autoencoder network. Moreover, it naturally deals with the unseen target speaker’s scenarios. We perform both objective and subjective evaluations to show the competitive performance of our proposed method compared with other state-ofthe-art models in terms of naturalness and target speaker similarity.

[1]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Haruo Hosoya,et al.  Group-based Learning of Disentangled Representations with Generalizability for Novel Contents , 2019, IJCAI.

[4]  Tomoki Toda,et al.  Baseline System of Voice Conversion Challenge 2020 with Cyclic Variational Autoencoder and Parallel WaveGAN , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[7]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[8]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[9]  Lin-Shan Lee,et al.  Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations , 2018, INTERSPEECH.

[10]  Ben Poole,et al.  Weakly-Supervised Disentanglement Without Compromises , 2020, ICML.

[11]  Zhe Gan,et al.  Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning , 2021, ICLR.

[12]  Hiroko Terasawa,et al.  A statistical model of timbre perception , 2006, SAPA@INTERSPEECH.

[13]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[14]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[16]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[17]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[18]  Moncef Gabbouj,et al.  Voice Conversion Using Dynamic Kernel Partial Least Squares Regression , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Kou Tanaka,et al.  ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder , 2018, ArXiv.

[20]  Ricardo Gutierrez-Osuna,et al.  Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion , 2019, INTERSPEECH.

[21]  Kou Tanaka,et al.  Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Dongsuk Yook,et al.  Many-To-Many Voice Conversion Using Conditional Cycle-Consistent Adversarial Networks , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[24]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[25]  Hung-Yi Lee,et al.  VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture , 2020, INTERSPEECH.