论文信息 - Explicit Intensity Control for Accented Text-to-speech

Explicit Intensity Control for Accented Text-to-speech

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.

Haizhou Li | Guanglai Gao | Rui Liu | Haolin Zuo | De Hu

[1] T. Shinozaki,et al. Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification , 2022, INTERSPEECH.

[2] Björn Schuller,et al. Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning , 2022, INTERSPEECH.

[3] Rubén Pérez Ramón,et al. Foreign accent strength and intelligibility at the segmental level , 2022, Speech Commun..

[4] Brian Kan-Wing Mak,et al. Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment , 2020, INTERSPEECH.

[5] Jaehyeon Kim,et al. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[6] Tie-Yan Liu,et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[7] Heiga Zen,et al. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[8] Ricardo Gutierrez-Osuna,et al. L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[9] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[10] Junichi Yamagishi,et al. Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Samy Bengio,et al. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[14] Sangramsing N. Kayte,et al. Speech Synthesis System for Marathi Accent using FESTVOX , 2015 .

[15] François Laviolette,et al. Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[16] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yong Wang,et al. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19] Simon King,et al. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[20] Alan W. Black,et al. Accent Group modeling for improved prosody in statistical parameteric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Chai Wutiwiwatchai,et al. Accent level adjustment in bilingual Thai-English text-to-speech synthesis , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Kristin Precoda,et al. EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications , 2010 .

[24] Junichi Yamagishi,et al. Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis , 2010, Speech Commun..

[25] Tracey M. Derwing,et al. THE MUTUAL INTELLIGIBILITY OF L2 SPEECH , 2006, Studies in Second Language Acquisition.

[26] Steve J. Young,et al. Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[27] Alex Waibel,et al. Consonant recognition by modular construction of large phonemic time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28] T. Shinozaki,et al. Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Veriﬁcation , 2022 .

[29] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[30] Junichi Yamagishi,et al. Generating segmental foreign accent , 2014, INTERSPEECH.

[31] Thomas Niesler,et al. Automatic conversion between pronunciations of different English accents , 2011, Speech Commun..