GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models the distribution of mel-spectrograms with speaker-conditioned flow and disentangles the mel-spectrogram space into content- and pitch-relevant dimensions, while GlowVC-explicit models the explicit distribution with unconditioned flow and disentangles said space into content-, pitch- and speaker-relevant dimensions. We evaluate our models in terms of intelligibility, speaker similarity and naturalness for intra- and cross-lingual conversion in seen and unseen languages. GlowVC models greatly outperform AutoVC baseline in terms of intelligibility, while achieving just as high speaker similarity in intra-lingual VC, and slightly worse in the cross-lingual setting. Moreover, we demonstrate that GlowVC-explicit surpasses both GlowVC-conditional and AutoVC in terms of naturalness.

[1]  Zhizheng Wu,et al.  Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation , 2021, Interspeech.

[2]  Xiaofen Xing,et al.  Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations , 2021, Interspeech.

[3]  Daniel Korzekwa,et al.  Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech , 2021, 11th ISCA Speech Synthesis Workshop (SSW 11).

[4]  Jasha Droppo,et al.  Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows , 2021, Interspeech.

[5]  Sandra M. Aluísio,et al.  SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model , 2021, Interspeech.

[6]  Daniel Korzekwa,et al.  Universal Neural Vocoding with Parallel Wavenet , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yuan Jiang,et al.  Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[8]  Junichi Yamagishi,et al.  Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[9]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[10]  Junichi Yamagishi,et al.  NAUTILUS: A Versatile Voice Cloning System , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Seung-won Park,et al.  Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data , 2020, INTERSPEECH.

[12]  Thomas Drugman,et al.  CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech , 2020, INTERSPEECH.

[13]  Hirokazu Kameoka,et al.  Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining , 2019, INTERSPEECH.

[14]  Kou Tanaka,et al.  StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion , 2019, INTERSPEECH.

[15]  Joan Serra,et al.  Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion , 2019, NeurIPS.

[16]  Mark Hasegawa-Johnson,et al.  Zero-Shot Voice Style Transfer with Only Autoencoder Loss , 2019, ICML.

[17]  Haizhou Li,et al.  Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hung-yi Lee,et al.  One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization , 2019, INTERSPEECH.

[19]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[20]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[21]  Haizhou Li,et al.  Average Modeling Approach to Voice Conversion with Non-Parallel Data , 2018, Odyssey.

[22]  Kou Tanaka,et al.  StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[23]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .