Cross-lingual studies of ASR errors: paradigms for perceptual evaluations

It is well-known that human listeners significantly outperform machines when it comes to transcribing speech. This paper presents a progress report of the joint research in the automatic vs human speech transcription and of the perceptual experiments developed at LIMSI that aims to increase our understanding of automatic speech recognition errors. Two paradigms are described here in which human listeners are asked to transcribe speech segments containing words that are frequently misrecognized by the system. In particular, we sought to gain information about the impact of increased context to help humans disambiguate problematic lexical items, typically homophone or near-homophone words. The long-term aim of this research is to improve the modeling of ambiguous contexts so as to reduce automatic transcription errors.

[1]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[2]  Herman J. M. Steeneken,et al.  Human benchmarks for speaker independent large vocabulary recognition performance , 1995, EUROSPEECH.

[3]  Joseph P. Olive,et al.  Two protocols comparing human and machine phonetic recognition performance in conversational speech , 2008, INTERSPEECH.

[4]  Anne Cutler,et al.  The lexical statistics of word recognition problems caused by L2 phonetic confusion , 2005, INTERSPEECH.

[5]  S. Furui,et al.  AN ASSESSMENT OF AUTOMATIC RECOGNITION TECHNIQUES FOR SPONTANEOUS SPEECH IN COMPARISON WITH HUMAN PERFORMANCE , 2002 .

[6]  Fenguangzhai Song CD , 1992 .

[7]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[8]  Louis C. W. Pols,et al.  Flexible human speech recognition , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Lori Lamel,et al.  Cross-Lingual Study of ASR Errors: On the Role of the Context in Human Perception of Near-Homophones , 2011, INTERSPEECH.

[10]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[11]  Anne Cutler,et al.  Response time as a metric for comparison of speech recognition by humans and machines , 1992, ICSLP.

[12]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[13]  Roger K. Moore A comparison of the data requirements of automatic speech recognition systems and human listeners , 2003, INTERSPEECH.

[14]  Martine Adda-Decker,et al.  De la reconnaissance automatique de la parole à l ’ analyse linguistique de corpus oraux , 2006 .

[15]  Anne Cutler,et al.  Constraints on theories of human vs. machine recognition of speech , 2001 .

[16]  Lori Lamel,et al.  A perceptual investigation of speech transcription errors involving frequent near-homophones in French and american English , 2009, INTERSPEECH.

[17]  Lawrence R. Rabiner,et al.  Machine Recognition of Speech , 2007 .

[18]  B. Kollmeier,et al.  A HUMAN-MACHINE COMPARISON IN SPEECH RECOGNITION BASED ON A LOGATOME CORPUS , 2006 .

[19]  Joseph Picone,et al.  Benchmarking human performance for continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Louis ten Bosch,et al.  How Should a Speech Recognizer Work? , 2005, Cogn. Sci..