论文信息 - Voice Conversion towards Arbitrary Speakers With Limited Data

Voice Conversion towards Arbitrary Speakers With Limited Data

Voice conversion towards a specific speaker requires a large number of target speaker's utterances, which is expensive in practice. This paper proposes a speaker-adaptive voice conversion (SAVC) system, which accomplishes voice conversion towards arbitrary speakers with limited data. First, a multi-speaker voice conversion (MSVC) model is trained to learn the shared information between speakers and build a speaker latent space. Second, utterances of a new target speaker are used to fine tune the MSVC model aiming to learn the voice of the target speaker. In the two steps, phonetic posteriorgrams (PPGs), a speaker-independent linguistic feature, and speaker embeddings such as i-vector or x-vector are encoded to train the model. In order to achieve better results, two different adaptive approaches are explored: adaptation on the whole MSVC model or additional linear-hidden layers (AHL). As the results show, both adaptive approaches significantly outperform the MSVC model without adaptation. Besides, the whole adapted model based on x-vector gets a higher similarity to target speaker within 10 utterances.

Wenjun Zhang | Dandan Song | Ying Zhang

[1] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[2] Heng Lu,et al. Linear Networks Based Speaker Adaptation for Speech Synthesis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[4] Li-Rong Dai,et al. Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Tetsuya Takiguchi,et al. Voice conversion in high-order eigen space using deep belief nets , 2013, INTERSPEECH.

[6] Haizhou Li,et al. Average Modeling Approach to Voice Conversion with Non-Parallel Data , 2018, Odyssey.

[7] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[8] Xunying Liu,et al. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[9] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[10] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[11] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[12] Heiga Zen,et al. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[13] Jan Skoglund,et al. LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Eric Moulines,et al. Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[15] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Masanori Morise,et al. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[17] Sanjeev Khudanpur,et al. Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[18] Hao Wang,et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[19] Ricardo Gutierrez-Osuna,et al. Articulatory-based conversion of foreign accents with deep neural networks , 2015, INTERSPEECH.