S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations

This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced.

[1]  Li-Rong Dai,et al.  WaveNet Vocoder with Limited Training Data for Voice Conversion , 2018, INTERSPEECH.

[2]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Junichi Yamagishi,et al.  Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion , 2020, Blizzard Challenge / Voice Conversion Challenge.

[6]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[7]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[8]  Lirong Dai,et al.  Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[9]  Tomoki Toda,et al.  Any-to-One Sequence-to-Sequence Voice Conversion Using Self-Supervised Discrete Speech Representations , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Eugene Kharitonov,et al.  Speech Resynthesis from Discrete Disentangled Self-Supervised Representations , 2021, Interspeech.

[11]  Subjective evaluation of speech quality with a crowdsourcing approach Summary , 2022 .

[12]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Junichi Yamagishi,et al.  Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[15]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[18]  Babak Naderi,et al.  An Open source Implementation of ITU-T Recommendation P.808 with Validation , 2020, INTERSPEECH.

[19]  Junichi Yamagishi,et al.  An autoregressive recurrent mixture density network for parametric speech synthesis , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Shinji Watanabe,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[21]  Tomoki Toda,et al.  On Prosody Modeling for ASR+TTS Based Voice Conversion , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Hung-yi Lee,et al.  Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  J. Tao,et al.  CASIA Voice Conversion System for the Voice Conversion Challenge 2020 , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[24]  Meng Li,et al.  Exploring wav2vec 2.0 on speaker verification and language identification , 2020, Interspeech.

[25]  Hung-yi Lee,et al.  S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations , 2021, Interspeech 2021.

[26]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[27]  Yuan Jiang,et al.  Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[28]  Tomoki Toda,et al.  The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS , 2020, Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020.

[29]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .