Teaching American English pronunciation using a TTS service

In computer-assisted language learning (CALL) applications students are able to learn/improve a language using automated tools. CALL applications benefit from having spoken examples by native language speakers in order to teach pronunciation. Realistically, this is limited to the pre-defined curricula that the application is teaching. In this work we allow the learner to practice pronunciation on freely input text, where the reference audio is generated using a text-to-speech (TTS) system. Instead of building a TTS system from scratch, we use a high quality external service (Amazon Polly TTS). In order to successfully use Amazon Polly as a reference for teaching pronunciation, we carefully control the input text normalization and expansion steps and use the visemes information returned by Polly to select the best phonetic transcription out of all the possible transcriptions computed from the text. We show the usefulness of the approach by comparing the pronunciation scores obtained by a native speaker reading some test sentences to scores from the TTS audio on the same sentences. These show that the TTS audio reaches a similar pronunciation score as real audio, and therefore we conclude that it can be used as a reference for pronunciation learning. We also discuss and address issues of transcription and audio mismatch.

[1]  Hui-Hua Chiang A Comparison Between Teacher-Led and Online Text-to-Speech Dictation for Students’ Vocabulary Performance , 2019, English Language Teaching.

[2]  Linda Bradley,et al.  Future-proof CALL: language learning as exploration and encounters – short papers from EUROCALL 2018 , 2018 .

[3]  W. Cardoso,et al.  Can TTS help L2 learners develop their phonological awareness? , 2018, Future-proof CALL: language learning as exploration and encounters – short papers from EUROCALL 2018.

[4]  W. Cardoso,et al.  Who’s got talent? Comparing TTS systems for comprehensibility, naturalness, and intelligibility , 2018, Future-proof CALL: language learning as exploration and encounters – short papers from EUROCALL 2018.

[5]  Visar Berisha,et al.  Investigating the role of L1 in automatic pronunciation evaluation of L2 speech , 2018, INTERSPEECH.

[6]  Johanna Gerlach,et al.  AUTOMATIC EVALUATION OF THE PRONUNCIATION WITH CALL-SLT, A CONVERSATION PARTNER EXCLUSIVELY BASED ON SPEECH RECOGNITION , 2018, EDULEARN18 Proceedings.

[7]  Richard Harvey,et al.  Phoneme-to-viseme mappings: the good, the bad, and the ugly , 2017, Speech Commun..

[8]  Shinnosuke Takamichi,et al.  Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Ailbhe Ní Chasaide,et al.  Effects of Educational Context on Learners' Ratings of a Synthetic Voice , 2017, SLaTE.

[10]  David Escudero Mancebo,et al.  Evaluating the Efficiency of Synthetic Voice for Providing Corrective Feedback in a Pronunciation Training Tool Based on Minimal Pairs , 2017, SLaTE.

[11]  Frank K. Soong,et al.  Proficiency Assessment of ESL Learner's Sentence Prosody with TTS Synthesized Voice as Reference , 2017, INTERSPEECH.

[12]  Walcir Cardoso,et al.  An Evaluation of TTS as a Pedagogical Tool for Pronunciation Instruction: The "Foreign" Language Context. , 2017 .

[13]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[14]  Richard Sproat,et al.  Minimally Supervised Number Normalization , 2016, TACL.

[15]  Navdeep Jaitly,et al.  RNN Approaches to Text Normalization: A Challenge , 2016, ArXiv.

[16]  Richard Sproat,et al.  Minimally supervised written-to-spoken text normalization , 2016, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[18]  Shannon McCrocklin,et al.  Pronunciation learner autonomy: The potential of Automatic Speech Recognition , 2016 .

[19]  Lei Chen,et al.  Exploring deep learning architectures for automatically grading non-native spontaneous speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Josef R. Novak,et al.  Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework , 2015, Natural Language Engineering.

[21]  Richard Sproat,et al.  The Kestrel TTS text normalization system , 2014, Natural Language Engineering.

[22]  Kyuchul Yoon,et al.  Imposing native speakers' prosody on non-native speakers' utterances , 2007 .

[23]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[24]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[25]  Xavier Anguera Miró,et al.  English Language Speech Assistant , 2016, INTERSPEECH.

[26]  Elisa Pellegrino,et al.  Self-imitation in prosody training: a study on Japanese learners of Italian , 2015, SLaTE.

[27]  Isabel Trancoso,et al.  Less errors with TTS? A dictation experiment with foreign language learners , 2012, INTERSPEECH.