Phoneme confusions in human and automatic speech recognition

A comparison between automatic speech recognition (ASR) and human speech recognition (HSR) is performed as prerequisite for identifying sources of errors and improving feature extraction in ASR. HSR and ASR experiments are carried out with the same logatome database which consists of nonsense syllables. Two different kinds of signals are presented to human listeners: First, noisy speech samples are converted to Mel-frequency cepstral coefficients which are resynthesized to speech, with information about voicing and fundamental frequency being discarded. Second, the original signals with added noise are presented, which is used to evaluate the loss of information caused by the process of resynthesis. The analysis also covers the degradation of ASR caused by dialect or accent and shows that different error patterns emerge for ASR and HSR. The information loss induced by the calculation of ASR features has the same effect as a deteriation of the SNR by 10 dB. Index Terms: human speech recognition, automatic speech recognition, dialect, accent, phoneme confusions, MFCC

[1]  Tim Jürgens,et al.  Modelling the human-machine gap in speech reception: microscopic speech intelligibility prediction for normal-hearing subjects with an auditory model , 2007, INTERSPEECH.

[2]  Dirk Van Compernolle,et al.  Synthesizing speech from speech recognition parameters , 2004, INTERSPEECH.

[3]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  B. Kollmeier,et al.  A HUMAN-MACHINE COMPARISON IN SPEECH RECOGNITION BASED ON A LOGATOME CORPUS , 2006 .

[5]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[6]  Alfred Mertins,et al.  Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines , 2005, INTERSPEECH.

[7]  S.D. Peters,et al.  On the limits of speech recognition in noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..