Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency Factors that Increase ASR Error Rates

Many factors are thought to increase the chances of misrecognizing a word in ASR, including low frequency, nearby disfluencies, short duration, and being at the start of a turn. However, few of these factors have been formally examined. This paper analyzes a variety of lexical, prosodic, and disfluency factors to determine which are likely to increase ASR error rates. Findings include the following. (1) For disfluencies, effects depend on the type of disfluency: errors increase by up to 15% (absolute) for words near fragments, but decrease by up to 7.2% (absolute) for words near repetitions. This decrease seems to be due to longer word duration. (2) For prosodic features, there are more errors for words with extreme values than words with typical values. (3) Although our results are based on output from a system with speaker adaptation, speaker differences are a major factor influencing error rates, and the effects of features such as frequency, pitch, and intensity may vary between speakers.

[1]  Eric Fosler-Lussier,et al.  Effects of speaking rate and word frequency on pronunciations in convertional speech , 1999, Speech Commun..

[2]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Sadaoki Furui,et al.  Error analysis using decision trees in spontaneous presentation speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[4]  David B. Pisoni,et al.  Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics , 1996, Speech Commun..

[5]  E. Shriberg,et al.  Acoustic properties of disfluent repetitions , 1995 .

[6]  Richard M. Stern,et al.  On the effects of speech rate in large vocabulary speech recognition systems , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Lori Lamel,et al.  Do speech recognizers prefer female speakers? , 2005, INTERSPEECH.

[8]  Julia Hirschberg,et al.  Prosodic and other cues to speech recognition failures , 2004, Speech Commun..

[9]  Eric Fosler-Lussier,et al.  Erratum to: "Effects of speaking rate and word frequency on pronunciations in convertional speech": [Speech Communication 29 (1999) 137-158] , 2000, Speech Commun..

[10]  R. P. Fahey,et al.  On explaining certain male-female differences in the phonetic realization of vowel categories , 1996 .

[11]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[12]  Austin F. Frank,et al.  Analyzing linguistic data: a practical introduction to statistics using R , 2010 .

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[14]  Dan Jurafsky,et al.  Effects of disfluencies, predictability, and utterance position on word form variation in English conversation. , 2003, The Journal of the Acoustical Society of America.

[15]  Sadaoki Furui,et al.  Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance , 2008, Comput. Speech Lang..