Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

[1]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[2]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[3]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[4]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[5]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[6]  L D Shriberg,et al.  A procedure for phonetic transcription by consensus. , 1984, Journal of speech and hearing research.

[7]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[8]  A. Bronkhorst,et al.  A model for context effects in speech recognition. , 1993, The Journal of the Acoustical Society of America.

[9]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[10]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[11]  Jean C. Krause,et al.  The effects of speaking rate on the intelligibility of speech for various speaking modes , 1995 .

[12]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[13]  Florian Schiel,et al.  Automatic detection and segmentation of pronunciation variants in German speech corpora , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[15]  S.D. Peters,et al.  On the limits of speech recognition in noise , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[16]  Hynek Hermansky,et al.  Temporal patterns (TRAPs) in ASR of noisy speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[18]  W. Dreschler,et al.  ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. International Collegium for Rehabilitative Audiology. , 2001, Audiology : official organ of the International Society of Audiology.

[19]  Hermann Ney,et al.  Using phase spectrum information for improved speech recognition performance , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Sarah Hawkins,et al.  Roles and representations of systematic fine phonetic detail in speech understanding , 2003, J. Phonetics.

[21]  Dirk Van Compernolle,et al.  Synthesizing speech from speech recognition parameters , 2004, INTERSPEECH.

[22]  J. C. Krause,et al.  Acoustic properties of naturally produced clear speech at normal speaking rates. , 1996, The Journal of the Acoustical Society of America.

[23]  Alfred Mertins,et al.  Oldenburg logatome speech corpus (OLLO) for speech recognition experiments with humans and machines , 2005, INTERSPEECH.

[24]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[25]  Michael J. Carey,et al.  A speech similarity distance weighting for robust recognition , 2005, INTERSPEECH.

[26]  Jont B. Allen,et al.  Consonant and vowel confusions in speech-weighted noise , 2007, INTERSPEECH.

[27]  B. Kollmeier,et al.  A HUMAN-MACHINE COMPARISON IN SPEECH RECOGNITION BASED ON A LOGATOME CORPUS , 2006 .

[28]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[29]  Alfred Mertins,et al.  Automatic speech recognition and speech variability: A review , 2007, Speech Commun..

[30]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[31]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[32]  Joseph P. Olive,et al.  Two protocols comparing human and machine phonetic recognition performance in conversational speech , 2008, INTERSPEECH.

[33]  Odette Scharenborg,et al.  The interspeech 2008 consonant challenge , 2008, INTERSPEECH.

[34]  Predicting consonant recognition in quiet for listeners with normal hearing and hearing impairment using an auditory model. , 2009 .