Highly accurate children's speech recognition for interactive reading tutors using subword units

Speech technology offers great promise in the field of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors affected by difficulties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Specifically, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based speech recognition. The proposed subword unit based speech recognition framework is shown to provide equivalent accuracy to a whole-word based speech recognizer while enabling detection of oral reading events and finer grained speech analysis during recognition. The efficacy of the approach is demonstrated using data collected from children in grades 3-5, namely 34.6% of partial words with reasonable evidence in the speech signal are detected at a low false alarm rate of 0.5%.

[1]  Hermann Ney,et al.  Improved methods for vocal tract normalization , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  J. Beck,et al.  Improving Language Models by Learning from Speech Recognition Errors in a Reading Tutor that Listens , 2003 .

[4]  Shrikanth S. Narayanan,et al.  Acoustics of children's speech: developmental changes of temporal and spectral parameters. , 1999, The Journal of the Acoustical Society of America.

[5]  Qun Li,et al.  An analysis of the causes of increased error rates in children²s speech recognition , 2002, INTERSPEECH.

[6]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[7]  Bryan L. Pellom,et al.  Data driven subword unit modeling for speech recognition and its application to interactive reading tutors , 2005, INTERSPEECH.

[8]  H.,et al.  Token Passing : a Simple Conceptual Model for ConnectedSpeech Recognition , 1989 .

[9]  Bryan L. Pellom,et al.  Children's speech recognition with application to interactive books and tutors , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[10]  Ronald A. Cole,et al.  Advances in Children's Speech Recognition within an Interactive Literacy Tutor , 2004, HLT-NAACL.

[11]  Jack Mostow,et al.  How effective is unsupervised data collection for children's speech recognition? , 1998, ICSLP.

[12]  Martin J. Russell,et al.  Recognition of read and spontaneous children's speech using two new corpora , 2004, INTERSPEECH.

[13]  Shrikanth S. Narayanan,et al.  Automatic speech recognition for children , 1997, EUROSPEECH.

[14]  Kadri Hacioglu,et al.  Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Michael Picheny,et al.  Improvements in children's speech recognition performance , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Shrikanth S. Narayanan,et al.  Robust recognition of children's speech , 2003, IEEE Trans. Speech Audio Process..

[17]  Michael Kyle McCandless Word rejection for a literacy tutor , 1992 .

[18]  Jack Mostow,et al.  Predicting oral reading miscues , 2002, INTERSPEECH.

[19]  Jack Mostow,et al.  A Prototype Reading Coach that Listens , 1994, AAAI.

[20]  Sarel van Vuuren,et al.  Learning to Read with a Virtual Tutor : Foundations to Literacy , 2004 .

[21]  James Glass,et al.  Modelling out-of-vocabulary words for robust speech recognition , 2002 .

[22]  Laura A. Dabbish,et al.  Mining a Database of Reading Mistakes: For What Should an Automated Reading Tutor Listen? , 2001 .

[23]  L.T.W. Verhoeven,et al.  Interactive literacy education: facilitating literacy environments through technology , 2008 .

[24]  Piero Cosi,et al.  Italian children's speech recognition for advanced interactive literacy tutors , 2005, INTERSPEECH.

[25]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[26]  M. Eskénazi KIDS: A database of children’s speech , 1996 .

[27]  Joakim Gustafson,et al.  Voice transformations for improving children²s speech recognition in a publicly available dialogue system , 2002, INTERSPEECH.

[28]  Satanjeev Banerjee,et al.  Training a confidence measure for a reading tutor that listens , 2003, INTERSPEECH.

[29]  R. Cole,et al.  THE OGI KIDS’ SPEECH CORPUS AND RECOGNIZERS , 2000 .

[30]  Satanjeev Banerjee,et al.  Evaluating the effect of predicting oral reading miscues , 2003, INTERSPEECH.

[31]  Sean Martin,et al.  Analysis and Detection of Reading Miscues for Interactive Literacy Tutors , 2004, COLING.

[32]  Mikko Kurimo,et al.  On lexicon creation for turkish LVCSR , 2003, INTERSPEECH.

[33]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[34]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[35]  Sarel van Vuuren,et al.  How Marni Teaches Children to Read , 2006 .

[36]  Ronald A. Cole,et al.  Perceptive animated interfaces: first steps toward a new paradigm for human-computer interaction , 2003, Proc. IEEE.

[37]  Diego Giuliani,et al.  Investigating recognition of children's speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[38]  Shrikanth S. Narayanan,et al.  Analysis of children's speech: duration, pitch and formants , 1997, EUROSPEECH.