论文信息 - End-To-End Accent Conversion Without Using Native Utterances

End-To-End Accent Conversion Without Using Native Utterances

Techniques for accent conversion (AC) aim to convert non-native to native accented speech. Conventional AC methods try to convert only the speaker identity of a native speaker’s voice to that of the non-native accented target speaker, leaving the underlying content and pronunciations unchanged. This hinders their practical use in real-world applications, because native-accented utterances are required at conversion stage. In this paper, we present an end-to-end framework, which is able to conduct AC from non-native-accented utterances without using any native-accented utterances during online conversion. We achieve this by independently extracting linguistic and speaker representations from non-native accented speech and condition a speech synthesis model on these representations to generate native-accented speech. Experiments on open-source data corpora show that the proposed system can convert Hindi-accented English speech into native American English speech with high naturalness, which is indistinguishable from native-accented recordings in terms of accent.

[1] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Ricardo Gutierrez-Osuna,et al. Developing Objective Measures of Foreign-Accent Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Ricardo Gutierrez-Osuna,et al. Foreign accent conversion through voice morphing , 2013, INTERSPEECH.

[4] Ricardo Gutierrez-Osuna,et al. Accent Conversion Using Phonetic Posteriorgrams , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Junichi Yamagishi,et al. SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[6] Jonathan G. Fiscus,et al. Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[7] Hao Wang,et al. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[8] Ricardo Gutierrez-Osuna,et al. Foreign accent conversion in computer assisted pronunciation training , 2009, Speech Commun..

[9] Joon Son Chung,et al. VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[10] Erich Elsen,et al. Efficient Neural Audio Synthesis , 2018, ICML.

[11] Quan Wang,et al. Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Li-Rong Dai,et al. Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14] Milos Cernak,et al. End-to-End Accented Speech Recognition , 2019, INTERSPEECH.

[15] Quoc V. Le,et al. Listen, Attend and Spell , 2015, ArXiv.

[16] Mark Huckvale,et al. Spoken language conversion with accent morphing , 2007, SSW.

[17] Tomoki Toda,et al. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[19] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Ricardo Gutierrez-Osuna,et al. Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Xunying Liu,et al. The HCCL-CUHK System for the Voice Conversion Challenge 2018 , 2018, Odyssey.

[22] Junichi Yamagishi,et al. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[23] Xunying Liu,et al. Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams , 2019, INTERSPEECH.

[24] Patrick Nguyen,et al. Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[25] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Xunying Liu,et al. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[27] Ricardo Gutierrez-Osuna,et al. Using Phonetic Posteriorgram Based Frame Pairing for Segmental Accent Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28] Joon Son Chung,et al. VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.