Voice Conversion towards Arbitrary Speakers With Limited Data

Voice conversion towards a specific speaker requires a large number of target speaker's utterances, which is expensive in practice. This paper proposes a speaker-adaptive voice conversion (SAVC) system, which accomplishes voice conversion towards arbitrary speakers with limited data. First, a multi-speaker voice conversion (MSVC) model is trained to learn the shared information between speakers and build a speaker latent space. Second, utterances of a new target speaker are used to fine tune the MSVC model aiming to learn the voice of the target speaker. In the two steps, phonetic posteriorgrams (PPGs), a speaker-independent linguistic feature, and speaker embeddings such as i-vector or x-vector are encoded to train the model. In order to achieve better results, two different adaptive approaches are explored: adaptation on the whole MSVC model or additional linear-hidden layers (AHL). As the results show, both adaptive approaches significantly outperform the MSVC model without adaptation. Besides, the whole adapted model based on x-vector gets a higher similarity to target speaker within 10 utterances.

[1]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[2]  Heng Lu,et al.  Linear Networks Based Speaker Adaptation for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[4]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Tetsuya Takiguchi,et al.  Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[6]  Haizhou Li,et al.  Average Modeling Approach to Voice Conversion with Non-Parallel Data , 2018, Odyssey.

[7]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Xunying Liu,et al.  Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[11]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[12]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[13]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[17]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[18]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[19]  Ricardo Gutierrez-Osuna,et al.  Articulatory-based conversion of foreign accents with deep neural networks , 2015, INTERSPEECH.