Evaluating a 3-D virtual talking head on pronunciation learning

Abstract We evaluate a 3-D virtual talking head on non-native Mandarin speaker pronunciation learning under three language presentation conditions – audio only (AU), human face video (HF) and audio-visual animation of a three-dimensional talking head (3-D). An auto language tutor (ALT) configured with AU, HF and 3-D is developed as the computer-aided pronunciation training system. We apply both subjective and objective methods to study user acceptance of the 3-D talking head, user comparative impressions and pronunciation performance under different conditions. The subjective ratings show that the 3-D talking head achieved a high level of user acceptance, and both 3-D and HF were preferred to AU. The objective pronunciation learning improvements show that 3-D was more beneficial than AU with respect to blade-alveolar, blade-palatal, lingua-palatal, open-mouth, open-mouth(-i) and round-mouth. Learning with 3-D was better than learning with HF with respect to blade-alveolar, lingua-palatal and round-mouth, and the tones of falling-rising and falling. Learning with AU was better than learning with HF with respect to the falling-rising tone. Neither HF nor AU was superior to 3-D with respect to any of the initials, finals and tones.

[1]  Thomas Niesler,et al.  Automatically assessing the oral proficiency of proficient L2 speakers , 2009, SLaTE.

[2]  Ying Liu,et al.  Using visual speech for training Chinese pronunciation: an in-vivo experiment , 2007, SLaTE.

[3]  Andrew Faulkner,et al.  Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English , 2005, Speech Commun..

[4]  J. Navarra,et al.  Hearing lips in a second language: visual articulatory information enables the perception of second language sounds , 2007, Psychological research.

[5]  Steve J. Young,et al.  Language learning based on non-native speech recognition , 1997, EUROSPEECH.

[6]  William H. Baxter,et al.  Middle Chinese: A Study in Historical Phonology , 1987 .

[7]  Lan Wang,et al.  An interactive speech training system with virtual reality articulation for Mandarin-speaking hearing impaired children , 2013, 2013 IEEE International Conference on Information and Automation (ICIA).

[8]  Thomas Hueber,et al.  On the use of an articulatory talking head for second language pronunciation training: the case of Chinese learners of French , 2014 .

[9]  Sebastian Möller,et al.  Evaluating talking heads for smart home systems , 2008, ICMI '08.

[10]  Hongbin Zha,et al.  Vision Based Speech Animation Transferring with Underlying Anatomical Structure , 2006, ACCV.

[11]  Hui Chen,et al.  Combined X-ray and facial videos for phoneme-level articulator dynamics , 2010, The Visual Computer.

[12]  Slim Ouni,et al.  Pronunciation training: the role of eye and ear , 2008, INTERSPEECH.

[13]  Karl F. MacDorman,et al.  The Uncanny Valley [From the Field] , 2012, IEEE Robotics Autom. Mag..

[14]  F. Paas,et al.  Cognitive Architecture and Instructional Design , 1998 .

[15]  Richard E. Mayer,et al.  Multimedia Learning , 2001, Visible Learning Guide to Student Achievement.

[16]  Yuan-fu Liao,et al.  A preliminary study on corpus design for computer-assisted German and Mandarin language learning , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[17]  Olov Engwall,et al.  Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes , 2008, INTERSPEECH.

[18]  Dominic W. Massaro,et al.  Evaluation of synthetic and natural Mandarin visual speech: Initial consonants, single vowels, and syllables , 2011, Speech Commun..

[19]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[20]  Jörn Ostermann,et al.  User evaluation: Synthetic talking faces for interactive services , 1999, The Visual Computer.

[21]  Gölge Seferoglu,et al.  Improving students' pronunciation through accent reduction software , 2005, Br. J. Educ. Technol..

[22]  Wang Zhiming,et al.  Text-To-Visual Speech in Chinese Based on Data-Driven Approach , 2005 .

[23]  Helen Meng,et al.  Enunciate: An internet-accessible computer-aided pronunciation training system and related user evaluations , 2011, 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA).

[24]  Sebastian Möller,et al.  Quality of talking heads in different interaction and media contexts , 2010, Speech Commun..

[25]  Alfred Bork,et al.  Multimedia in Learning , 2001 .

[26]  Hui Chen,et al.  Phoneme-level articulatory animation in pronunciation training , 2012, Speech Commun..

[27]  Sascha Fagel,et al.  Visual information and redundancy conveyed by internal articulator dynamics in synthetic audiovisual speech , 2007, INTERSPEECH.

[28]  Daming Shi,et al.  Real-time lip synchronization using wavelet network , 2005, 2005 International Conference on Cyberworlds (CW'05).

[29]  Lianhong Cai,et al.  Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar , 2006, INTERSPEECH.

[30]  Gérard Bailly,et al.  Close Shadowing Natural Versus Synthetic Speech , 2003, Int. J. Speech Technol..

[31]  Frank K. Soong,et al.  High quality lip-sync animation for 3D photo-realistic talking head , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Bin Ma,et al.  Large-scale characterization of Mandarin pronunciation errors made by native speakers of European languages , 2013, INTERSPEECH.

[33]  Dominic W. Massaro Embodied Agents in Language Learning for Children with Language Challenges , 2006, ICCHP.

[34]  Charles N. Li,et al.  Mandarin Chinese: A Functional Reference Grammar , 1989 .

[35]  Hongbin Zha,et al.  Transferring of Speech Movements from Video to 3D Face Space , 2007, IEEE Transactions on Visualization and Computer Graphics.

[36]  Gérard Bailly,et al.  Can you "read tongue movements"? , 2008, INTERSPEECH.

[37]  Arthur C. Graesser,et al.  Toward Spoken Human–Computer Tutorial Dialogues , 2010, Hum. Comput. Interact..

[38]  Ahmad Zamzuri Mohamad Ali,et al.  The effects of realism level of talking-head animated character on students' pronunciation learning , 2015, 2015 International Conference on Science in Information Technology (ICSITech).

[39]  Joanna Light,et al.  Using visible speech to train perception and production of speech for individuals with hearing loss. , 2004, Journal of speech, language, and hearing research : JSLHR.

[40]  Guillaume Gibert,et al.  Evaluating a synthetic talking head using a dual task: Modality effects on speech understanding and cognitive load , 2013, Int. J. Hum. Comput. Stud..

[41]  Helmer Strik,et al.  ASR-based corrective feedback on pronunciation: does it really work? , 2006, INTERSPEECH.

[42]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[43]  Wesley Mattheyses,et al.  On the Importance of Audiovisual Coherence for the Perceived Quality of Synthesized Visual Speech , 2009, EURASIP J. Audio Speech Music. Process..

[44]  Rong Tong,et al.  iCALL corpus: Mandarin Chinese spoken by non-native speakers of European descent , 2015, INTERSPEECH.

[45]  E. Zee,et al.  Standard Chinese (Beijing) , 2003, Journal of the International Phonetic Association.

[46]  Hui Chen,et al.  Intelligible enhancement of 3D articulation animation by incorporating airflow information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Sascha Fagel,et al.  A 3-d virtual head as a tool for speech therapy for children , 2008, INTERSPEECH.

[48]  Catherine J. Stevens,et al.  Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis , 2015, Computational cognitive science.

[49]  Olov Engwall,et al.  Can visualization of internal articulators support speech perception? , 2008, INTERSPEECH.

[50]  Daesang Kim,et al.  Effects of Text, Audio, and Graphic Aids in Multimedia Instruction for Vocabulary Learning , 2008, J. Educ. Technol. Soc..

[51]  Ahmad Zamzuri Mohamad Ali,et al.  Effects of Verbal Components in 3D Talking-head on Pronunciation Learning among Non-native Speakers , 2015, J. Educ. Technol. Soc..

[53]  James Emil Flege,et al.  Factors affecting degree of foreign accent in an L2: a review , 2001, J. Phonetics.

[54]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[55]  R. Gorsuch Exploratory factor analysis: its role in item analysis. , 1997, Journal of personality assessment.

[56]  Helmer Strik,et al.  ASR corrective feedback on pronunciation: Does it really work? , 2006 .