Explicit Intensity Control for Accented Text-to-speech

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.

[1]  T. Shinozaki,et al.  Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification , 2022, INTERSPEECH.

[2]  Björn Schuller,et al.  Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning , 2022, INTERSPEECH.

[3]  Rubén Pérez Ramón,et al.  Foreign accent strength and intelligibility at the segmental level , 2022, Speech Commun..

[4]  Brian Kan-Wing Mak,et al.  Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment , 2020, INTERSPEECH.

[5]  Jaehyeon Kim,et al.  HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.

[6]  Tie-Yan Liu,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2020, ICLR.

[7]  Heiga Zen,et al.  Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning , 2019, INTERSPEECH.

[8]  Ricardo Gutierrez-Osuna,et al.  L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[9]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[10]  Junichi Yamagishi,et al.  Cyborg Speech: Deep Multilingual Speech Synthesis for Generating Segmental Foreign Accent with Natural Prosody , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[14]  Sangramsing N. Kayte,et al.  Speech Synthesis System for Marathi Accent using FESTVOX , 2015 .

[15]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[16]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[20]  Alan W. Black,et al.  Accent Group modeling for improved prosody in statistical parameteric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Chai Wutiwiwatchai,et al.  Accent level adjustment in bilingual Thai-English text-to-speech synthesis , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Kristin Precoda,et al.  EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications , 2010 .

[24]  Junichi Yamagishi,et al.  Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis , 2010, Speech Commun..

[25]  Tracey M. Derwing,et al.  THE MUTUAL INTELLIGIBILITY OF L2 SPEECH , 2006, Studies in Second Language Acquisition.

[26]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[27]  Alex Waibel,et al.  Consonant recognition by modular construction of large phonemic time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  T. Shinozaki,et al.  Self-Supervised Learning with Multi-Target Contrastive Coding for Non-Native Acoustic Modeling of Mispronunciation Verification , 2022 .

[29]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[30]  Junichi Yamagishi,et al.  Generating segmental foreign accent , 2014, INTERSPEECH.

[31]  Thomas Niesler,et al.  Automatic conversion between pronunciations of different English accents , 2011, Speech Commun..