Cross-Lingual Voice Conversion With Controllable Speaker Individuality Using Variational Autoencoder and Star Generative Adversarial Network

This paper proposes a non-parallel cross-lingual voice conversion (CLVC) model that can mimic voice while continuously controlling speaker individuality on the basis of the variational autoencoder (VAE) and star generative adversarial network (StarGAN). Most studies on CLVC only focused on mimicking a particular speaker voice without being able to arbitrarily modify the speaker individuality. In practice, the ability to generate speaker individuality may be more useful than just mimicking voice. Therefore, the proposed model reliably extracts the speaker embedding from different languages using a VAE. An F0 injection method is also introduced into our model to enhance the F0 modeling in the cross-lingual setting. To avoid the over-smoothing degradation problem of the conventional VAE, the adversarial training scheme of the StarGAN is adopted to improve the training-objective function of the VAE in a CLVC task. Objective and subjective measurements confirm the effectiveness of the proposed model and F0 injection method. Furthermore, speaker-similarity measurement on fictitious voices reveal a strong linear relationship between speaker individuality and interpolated speaker embedding, which indicates that speaker individuality can be controlled with our proposed model.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Akihiko Ohsuga,et al.  Fast Many-to-One Voice Conversion using Autoencoders , 2017, ICAART.

[3]  Tomoki Toda,et al.  Non-Parallel Voice Conversion with Cyclic Variational Autoencoder , 2019, INTERSPEECH.

[4]  Yonghong Yan,et al.  High Quality Voice Conversion through Phoneme-Based Linear Mapping Functions with STRAIGHT for Mandarin , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[5]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Kou Tanaka,et al.  ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[10]  Haizhou Li,et al.  Conditional restricted Boltzmann machine for voice conversion , 2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing.

[11]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[12]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Shinnosuke Takamichi,et al.  Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Masato Akagi,et al.  Non-parallel Voice Conversion with Controllable Speaker Individuality using Variational Autoencoder , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[16]  Tomoki Koriyama,et al.  JVS corpus: free Japanese multi-speaker voice corpus , 2019, ArXiv.

[17]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[18]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[21]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[22]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[23]  Georg Martius,et al.  Variational Autoencoders Pursue PCA Directions (by Accident) , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tomoki Toda,et al.  Postfilters to Modify the Modulation Spectrum for Statistical Parametric Speech Synthesis , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Hirokazu Kameoka,et al.  CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Seyed Hamidreza Mohammadi,et al.  An overview of voice conversion systems , 2017, Speech Commun..

[27]  Tetsuya Takiguchi,et al.  Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[29]  Haizhou Li,et al.  On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[30]  Yu Tsao,et al.  Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[31]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[32]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.