Exploiting automatic speech recognition errors to enhance partial and synchronized caption for facilitating second language listening

Abstract This paper addresses the viability of using Automatic Speech Recognition (ASR) errors as the predictor of difficulties in speech segments, thereby exploiting them to improve Partial and Synchronized Caption (PSC), which we have proposed to train second language (L2) listening skill by encouraging listening over reading. The system uses ASR technology to make word-level text-to-speech synchronization and generates a partial caption. The baseline system determines difficult words based on three features: speech rate, word frequency and specificity. While it encompasses most of the difficult words, it does not cover a wide range of features that hinder L2 listening. Therefore, we propose the use of ASR systems as a model of L2 listeners and hypothesize that ASR errors can predict challenging speech segments for these learners. Among different cases of ASR errors, annotation results suggest the usefulness of four categories of homophones, minimal pairs, negatives, and breached boundaries for L2 listeners. A preliminary experiment with L2 learners focusing on these four categories of the ASR errors revealed that these cases highlight the problematic speech regions for L2 listeners. Based on the findings, the PSC system is enhanced to incorporate these kinds of useful ASR errors. An experiment with L2 learners demonstrated that the enhanced version of PSC is not only preferable, but also more helpful to facilitate the L2 listening process.

[1]  Walter Klinger,et al.  Missed word rates at increasing listening speeds of high-level Japanese speakers of English , 2010 .

[2]  A. Cutler,et al.  Rhythmic cues to speech segmentation: Evidence from juncture misperception , 1992 .

[3]  Roger Griffiths,et al.  Speech Rate and Listening Comprehension: Further Evidence of the Relationship , 1992 .

[4]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[5]  Eric Fosler-Lussier,et al.  Effects of speaking rate and word frequency on pronunciations in convertional speech , 1999, Speech Commun..

[6]  Wei Chen,et al.  ASR error detection in a conversational spoken language translation system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Tetyana Sydorenko,et al.  Modality of Input and Vocabulary Acquisition , 2010 .

[8]  Sadaoki Furui,et al.  Error analysis using decision trees in spontaneous presentation speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Satoshi Nakamura,et al.  Speaking rate compensation based on likelihood criterion in acoustic model training and decoding , 2002, INTERSPEECH.

[10]  A. Weber,et al.  Lexical competition in non-native spoken-word recognition , 2004 .

[11]  Lou Boves,et al.  Bridging automatic speech recognition and psycholinguistics: extending Shortlist to an end-to-end model of human speech recognition. , 2003, The Journal of the Acoustical Society of America.

[12]  Nobuko Osada,et al.  Listening Comprehension Research: A Brief Review of the Past Thirty Years , 2004 .

[13]  Mark Davies,et al.  A New Academic Vocabulary List , 2014 .

[14]  Anne Cutler,et al.  Exploiting prosodic probabilities in speech segmentation , 1991 .

[15]  Lori Lamel,et al.  Cross-Lingual Study of ASR Errors: On the Role of the Context in Human Perception of Near-Homophones , 2011, INTERSPEECH.

[16]  B. Kollmeier,et al.  Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. , 2011, The Journal of the Acoustical Society of America.

[17]  Ricardo Gutierrez-Osuna,et al.  Foreign Accent Conversion Through Concatenative Synthesis in the Articulatory Domain , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  John Field,et al.  Bricks or mortar: which parts of the input does a second language listener rely on? , 2008 .

[19]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[20]  Shrikanth S. Narayanan,et al.  An unsupervised quantitative measure for word prominence in spontaneous speech , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[21]  Larry Vandergrift,et al.  1. LISTENING TO LEARN OR LEARNING TO LISTEN? , 2004, Annual Review of Applied Linguistics.

[22]  Tineke Brunfaut,et al.  TEXT CHARACTERISTICS OF TASK INPUT AND DIFFICULTY IN SECOND LANGUAGE LISTENING COMPREHENSION , 2013, Studies in Second Language Acquisition.

[23]  Steve Tauroza,et al.  Speech Rates in British English , 1990 .

[24]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[25]  Walt Detmar Meurers,et al.  On using intelligent computer-assisted language learning in real-life foreign language teaching and learning , 2011, ReCALL.

[26]  Tatsuya Kawahara,et al.  ASR technology to empower partial and synchronized caption for L2 listening development , 2015, SLaTE.

[27]  A. Gilmore Authentic materials and authenticity in foreign language learning , 2007, Language Teaching.

[28]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[29]  Lori Lamel,et al.  Cross-lingual studies of ASR errors: paradigms for perceptual evaluations , 2012, LREC.

[30]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[31]  Tatsuya Kawahara,et al.  Partial and synchronized captioning: A new tool for second language listening development , 2014 .

[32]  John Field Promoting perception: lexical segmentation in L2 listening , 2003 .

[33]  Jan Niehues,et al.  The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[34]  Martine Danan,et al.  Captioning and Subtitling: Undervalued Language Learning Strategies , 2004 .

[35]  Joan-Tomás Pujolă,et al.  CALLing for help: researching language learning strategies using help facilities in a web-based multimedia program , 2002, ReCALL.

[36]  Averil Coxhead A New Academic Word List , 2000 .

[37]  Anne Cutler,et al.  Constraints on theories of human vs. machine recognition of speech , 2001 .

[38]  Tatsuya Kawahara,et al.  Language model and speaking rate adaptation for spontaneous presentation speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[39]  Norbert Schmitt,et al.  A reassessment of frequency and vocabulary size in L2 vocabulary teaching1 , 2012, Language Teaching.

[40]  S. Ross,et al.  What makes listening difficult? Factors affecting second language listening comprehension , 2010 .