论文信息 - One-Shot Voice Conversion by Vector Quantization

One-Shot Voice Conversion by Vector Quantization

In this paper, we propose a vector quantization (VQ) based one-shot voice conversion (VC) approach without any supervision on speaker label. We model the content embedding as a series of discrete codes and take the difference between quantize-before and quantize-after vector as the speaker embedding. We show that this approach has a strong ability to disentangle the content and speaker information with reconstruction loss only, and one-shot VC is thus achieved.

Hung-yi Lee | Da-Yi Wu | Hung-yi Lee | Da-Yi Wu

[1] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[2] Moncef Gabbouj,et al. Voice Conversion Using Partial Least Squares Regression , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Yu Tsao,et al. Voice conversion from non-parallel corpora using variational auto-encoder , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[4] Yu Tsao,et al. Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[5] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[6] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[8] Hung-yi Lee,et al. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[9] Kou Tanaka,et al. StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[10] Mark Hasegawa-Johnson,et al. Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[11] Ricardo Gutierrez-Osuna,et al. Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion , 2019, INTERSPEECH.

[12] Ron J. Weiss,et al. Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[14] Lin-Shan Lee,et al. Rhythm-Flexible Voice Conversion Without Parallel Data Using Cycle-GAN Over Phoneme Posteriorgram Sequences , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Serge J. Belongie,et al. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.