Visual-speech Synthesis of Exaggerated Corrective Feedback

To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.

[1]  Hua Yuan,et al.  Audiovisual synthesis of exaggerated speech for corrective feedback in computer-assisted pronunciation training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  R. Lyster Negotiation of Form, Recasts, and Explicit Correction in Relation to Error Types and Learner Repair in Immersion Classrooms , 1998 .

[3]  Ellen Ricard Beyond Fossilization: A Course in Strategies and Techniques in Pronunciation for Advanced Adult Learners. , 1986 .

[4]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[5]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[6]  Tracey M. Derwing,et al.  Second Language Accent and Pronunciation Teaching: A Research- Based Approach. , 2005 .

[7]  Jie Yang,et al.  Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication , 2017, CCS.

[8]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Gérard Bailly,et al.  Visual articulatory feedback for phonetic correction in second language learning , 2010 .

[10]  Kun Li,et al.  Rating Algorithm for Pronunciation of English Based on Audio Feature Pattern Matching , 2015 .

[11]  Kun Li,et al.  Integrating acoustic and state-transition models for free phone recognition in L2 English speech using multi-distribution deep neural networks , 2015, SLaTE.

[12]  Helen Meng,et al.  Enunciate: An internet-accessible computer-aided pronunciation training system and related user evaluations , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[13]  Pengfei Liu,et al.  mENUNCIATE: Development of a computer-aided pronunciation training system on a cross-platform framework for mobile, speech-enabled application development , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[14]  Stefan Winkler,et al.  Mean opinion score (MOS) revisited: methods and applications, limitations and alternatives , 2016, Multimedia Systems.

[15]  Hao Wang,et al.  Grading the Severity of Mispronunciations in CAPT Based on Statistical Analysis and Computational Speech Perception , 2014, Journal of Computer Science and Technology.

[16]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[17]  Pinaki Chakraborty,et al.  A review of tools and techniques for computer aided pronunciation training (CAPT) in English , 2019, Education and Information Technologies.

[18]  Philip Carr English phonetics and phonology : an introduction , 2013 .

[19]  Wai Kit Lo,et al.  Allophonic variations in visual speech synthesis for corrective feedback in CAPT , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Lianhong Cai,et al.  HMM-based emphatic speech synthesis for corrective feedback in computer-aided pronunciation training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  P. Keating,et al.  Articulatory strengthening at edges of prosodic domains. , 1997, The Journal of the Acoustical Society of America.

[22]  W. Strange Speech perception and linguistic experience : issues in cross-language research , 1995 .

[23]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[24]  Nuria Calvo Cortés,et al.  Negative language transfer when learning spanish as a foreign language , 2005 .

[25]  Terence Odlin,et al.  Language Transfer: Contents , 1989 .

[26]  Daniel Jones An outline of English phonetics , 1956 .

[27]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).