论文信息 - GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

In this paper, we propose GlowVC: a multilingual multi-speaker ﬂow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models the distribution of mel-spectrograms with speaker-conditioned ﬂow and disentangles the mel-spectrogram space into content- and pitch-relevant dimensions, while GlowVC-explicit models the explicit distribution with unconditioned ﬂow and disentangles said space into content-, pitch- and speaker-relevant dimensions. We evaluate our models in terms of intelligibility, speaker similarity and naturalness for intra- and cross-lingual conversion in seen and unseen languages. GlowVC models greatly outperform AutoVC baseline in terms of intelligibility, while achieving just as high speaker similarity in intra-lingual VC, and slightly worse in the cross-lingual setting. Moreover, we demonstrate that GlowVC-explicit surpasses both GlowVC-conditional and AutoVC in terms of naturalness.

[1] Zhizheng Wu,et al. Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation , 2021, Interspeech.

[2] Xiaofen Xing,et al. Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations , 2021, Interspeech.

[3] Daniel Korzekwa,et al. Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech , 2021, 11th ISCA Speech Synthesis Workshop (SSW 11).

[4] Jasha Droppo,et al. Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows , 2021, Interspeech.

[5] Sandra M. Aluísio,et al. SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model , 2021, Interspeech.

[6] Daniel Korzekwa,et al. Universal Neural Vocoding with Parallel Wavenet , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Yuan Jiang,et al. Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[8] Junichi Yamagishi,et al. Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[9] Sungwon Kim,et al. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[10] Junichi Yamagishi,et al. NAUTILUS: A Versatile Voice Cloning System , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Seung-won Park,et al. Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data , 2020, INTERSPEECH.

[12] Thomas Drugman,et al. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[13] Hirokazu Kameoka,et al. Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[14] Kou Tanaka,et al. StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[15] Joan Serra,et al. Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion , 2019, NeurIPS.

[16] Mark Hasegawa-Johnson,et al. Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[17] Haizhou Li,et al. Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Hung-yi Lee,et al. One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[19] Shujie Liu,et al. Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[20] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[21] Haizhou Li,et al. Average Modeling Approach to Voice Conversion with Non-Parallel Data , 2018, Odyssey.

[22] Kou Tanaka,et al. StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[23] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[24] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .