Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children

Computer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.

[1]  Sudarsana Reddy Kadiri,et al.  Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering , 2022, ACM Multimedia.

[2]  Maxwell W. Libbrecht,et al.  Obtaining genetics insights from deep learning via explainable artificial intelligence , 2022, Nature Reviews Genetics.

[3]  M. Kurimo,et al.  wav2vec2-based Speech Rating System for Children with Speech Sound Disorder , 2022, INTERSPEECH.

[4]  C. Cucchiarini,et al.  ‘Look, I can speak correctly’: learning vocabulary and pronunciation through websites equipped with automatic speech recognition technology , 2022, Computer Assisted Language Learning.

[5]  M. Kurimo,et al.  Lahjoita puhetta - a large-scale corpus of spoken Finnish with some benchmarks , 2022, ArXiv.

[6]  Ping Li,et al.  Understanding the Interaction between Technology and the Learner: The Case of DLL , 2021, Bilingualism: Language and Cognition.

[7]  Alexander I. Rudnicky,et al.  Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition , 2021, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Mikko Kurimo,et al.  Self-Supervised End-to-End ASR for Low Resource L2 Swedish , 2021, Interspeech.

[9]  Ping Li,et al.  Digital Language Learning (DLL): Insights from Behavior, Cognition, and the Brain , 2021, Bilingualism: Language and Cognition.

[10]  Titouan Parcollet,et al.  SpeechBrain: A General-Purpose Speech Toolkit , 2021, ArXiv.

[11]  M. Kurimo,et al.  The Effects of a Digital Articulatory Game on the Ability to Perceive Speech-Sound Contrasts in Another Language , 2021, Frontiers in Education.

[12]  Antoine Laurent,et al.  End-to-end speaker segmentation for overlap-aware resegmentation , 2021, Interspeech.

[13]  Ricardo Gutierrez-Osuna,et al.  A Longitudinal Evaluation of Tablet-Based Child Speech Therapy with Apraxia World , 2021, ACM Trans. Access. Comput..

[14]  Emmanuel Dupoux,et al.  VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation , 2021, ACL.

[15]  A. Cummings,et al.  Intervention dose frequency: Phonological generalization is similar regardless of schedule , 2020, Child Language Teaching and Therapy.

[16]  Mikko Kurimo,et al.  Gaming enhances learning-induced plastic changes in the brain , 2020, Brain and Language.

[17]  Qin Jin,et al.  Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training , 2020, INTERSPEECH.

[18]  A. McAllister,et al.  Audience Response System-Based Evaluation of Intelligibility of Children's Connected Speech - Validity, Reliability and Listener Differences. , 2020, Journal of communication disorders.

[19]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[20]  Valentín Cardeñoso-Payo,et al.  Using Challenges to Enhance a Learning Game for Pronunciation Training of English as a Second Language , 2020, IEEE Access.

[21]  Enrico Costanza,et al.  Evaluating saliency map explanations for convolutional neural networks: a user study , 2020, IUI.

[22]  Haoran Xie,et al.  Digital game-based vocabulary learning: where are we and where are we going? , 2019, Computer Assisted Language Learning.

[23]  Mikko Kurimo,et al.  Transparent pronunciation scoring using articulatorily weighted phoneme edit distance , 2019, INTERSPEECH.

[24]  Chih-Kuan Yeh,et al.  On the (In)fidelity and Sensitivity for Explanations. , 2019, 1901.09392.

[25]  Mikko Kurimo,et al.  User Experiences from L2 Children Using a Speech Learning Application: Implications for Developing Speech Training Applications for Children , 2018, Adv. Hum. Comput. Interact..

[26]  Chin-Chung Tsai,et al.  Digital game-based second-language vocabulary learning and conditions of research designs: A meta-analysis study , 2018, Comput. Educ..

[27]  R. Gutierrez-Osuna,et al.  Automated speech analysis tools for children’s speech production: A systematic literature review , 2018, International journal of speech-language pathology.

[28]  S. Roulstone,et al.  A systematic review and classification of interventions for speech-sound disorder in preschool children. , 2018, International journal of language & communication disorders.

[29]  Joshua B. Tenenbaum,et al.  A critical period for second language acquisition: Evidence from 2/3 million English speakers , 2018, Cognition.

[30]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[31]  Mikko Kurimo,et al.  SIAK - A Game for Foreign Language Pronunciation Learning , 2017, INTERSPEECH.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[34]  Ting-Chia Hsu,et al.  Learning English with Augmented Reality: Do learning styles matter? , 2017, Comput. Educ..

[35]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[36]  L. Hartelius,et al.  Swedish Test of Intelligibility for Children (STI-CH) – Validity and reliability of a computer-mediated single word intelligibility test for children , 2015, Clinical linguistics & phonetics.

[37]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[38]  Jacobijn Sandberg,et al.  The added value of a gaming context and intelligent adaptation for a mobile learning application for vocabulary learning , 2014, Comput. Educ..

[39]  Victor M. Frank,et al.  Technologies for foreign language learning: a review of technology types and their effectiveness , 2014 .

[40]  G. Hickok Computational neuroanatomy of speech production , 2012, Nature Reviews Neuroscience.

[41]  Feng Rong,et al.  Sensorimotor Integration in Speech Processing: Computational Basis and Neural Organization , 2011, Neuron.

[42]  D. Poeppel,et al.  Mental Imagery of Speech and Movement Implicates the Dynamics of Internal Forward Models , 2010, Front. Psychology.

[43]  Tsung-Yu Liu,et al.  Using ubiquitous games in an English listening and speaking course: Impact on learning outcomes and motivation , 2010, Comput. Educ..

[44]  Helmer Strik,et al.  Oral proficiency training in Dutch L2: The contribution of ASR-based corrective feedback , 2009, Speech Commun..

[45]  Maxine Eskénazi,et al.  An overview of spoken language technology for education , 2009, Speech Commun..

[46]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[47]  M. Gluck,et al.  Cortico-striatal contributions to feedback-based learning: converging data from neuroimaging and neuropsychology. , 2004, Brain : a journal of neurology.

[48]  Silke M. Witt,et al.  Use of speech recognition in computer-assisted language learning , 2000 .

[49]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[50]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[51]  F H Guenther,et al.  Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. , 1995, Psychological review.

[52]  I. R. MacKay,et al.  Factors affecting strength of perceived foreign accent in a second language. , 1995, The Journal of the Acoustical Society of America.

[53]  D. Bavelier,et al.  Video games as rich environments to foster brain plasticity. , 2020, Handbook of clinical neurology.

[54]  Emmanuel O. Acquah,et al.  Digital game-based L2 learning outcomes for primary through high-school students: A systematic literature review , 2020, Comput. Educ..

[55]  Menghua Chen,et al.  The effectiveness of digital game-based vocabulary learning: A framework-based view of meta-analysis , 2018, Br. J. Educ. Technol..

[56]  Stephan J. Franciosi The Effect of Computer Game-Based Learning on FL Vocabulary Transferability , 2017, J. Educ. Technol. Soc..

[57]  Cecilia Blumenthal,et al.  LINUS. LINköpingsUnderSökningen : Ett fonologiskt testmaterial från 3 år , 2014 .

[58]  Laleh Aghlara,et al.  The effect of digital games on Iranian children's vocabulary retention in foreign language acquisition , 2011 .

[59]  Daniel Elenius,et al.  The PF_STAR children's speech corpus , 2005, INTERSPEECH.