Leveraging automatic speech recognition errors to detect challenging speech segments in TED talks

This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners’ problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the rootcauses of the ASR errors and compare them with L2 listeners’ transcription mistakes. Our analysis on the ASR errors revealed several categories, such as minimal pairs, homophones, negative cases, and boundary misrecognition, which are assumed to denote the challenging nature of the respective speech segments for L2 listeners. To confirm the usefulness of these categories, we asked L2 learners to watch and transcribe a short segment of TED videos, including the above-mentioned categories of errors. Results revealed that learners’ transcription mistakes substantially increase when they transcribe segments of the audio in which ASR made errors. This finding confirmed the potential of using ASR errors as a predictor of L2 learners’ difficulties in listening to a particular audio. Furthermore, this study provided us with valuable data to enrich the Partial and Synchronized Caption (PSC) system we proposed earlier to facilitate and promote L2 listening skills.