End-To-End Accent Conversion Without Using Native Utterances

Techniques for accent conversion (AC) aim to convert non-native to native accented speech. Conventional AC methods try to convert only the speaker identity of a native speaker’s voice to that of the non-native accented target speaker, leaving the underlying content and pronunciations unchanged. This hinders their practical use in real-world applications, because native-accented utterances are required at conversion stage. In this paper, we present an end-to-end framework, which is able to conduct AC from non-native-accented utterances without using any native-accented utterances during online conversion. We achieve this by independently extracting linguistic and speaker representations from non-native accented speech and condition a speech synthesis model on these representations to generate native-accented speech. Experiments on open-source data corpora show that the proposed system can convert Hindi-accented English speech into native American English speech with high naturalness, which is indistinguishable from native-accented recordings in terms of accent.

[1]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Ricardo Gutierrez-Osuna,et al.  Developing Objective Measures of Foreign-Accent Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Ricardo Gutierrez-Osuna,et al.  Foreign accent conversion through voice morphing , 2013, INTERSPEECH.

[4]  Ricardo Gutierrez-Osuna,et al.  Accent Conversion Using Phonetic Posteriorgrams , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[6]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[7]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[8]  Ricardo Gutierrez-Osuna,et al.  Foreign accent conversion in computer assisted pronunciation training , 2009, Speech Commun..

[9]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[10]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[11]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Li-Rong Dai,et al.  Non-Parallel Sequence-to-Sequence Voice Conversion With Disentangled Linguistic and Speaker Representations , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[14]  Milos Cernak,et al.  End-to-End Accented Speech Recognition , 2019, INTERSPEECH.

[15]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[16]  Mark Huckvale,et al.  Spoken language conversion with accent morphing , 2007, SSW.

[17]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ricardo Gutierrez-Osuna,et al.  Can voice conversion be used to reduce non-native accents? , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Xunying Liu,et al.  The HCCL-CUHK System for the Voice Conversion Challenge 2018 , 2018, Odyssey.

[22]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[23]  Xunying Liu,et al.  Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams , 2019, INTERSPEECH.

[24]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[25]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Xunying Liu,et al.  Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance , 2018, INTERSPEECH.

[27]  Ricardo Gutierrez-Osuna,et al.  Using Phonetic Posteriorgram Based Frame Pairing for Segmental Accent Conversion , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.