Comparing human and automatic speech recognition in simple and complex acoustic scenes

Abstract Former comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) have shown that humans outperform ASR systems in nearly all speech recognition tasks. However, recent progress in ASR has led to substantial improvements of recognition accuracy, and it is therefore unclear how large the task-dependent human-machine gap still remains. This paper investigates this gap between HSR and ASR based on deep neural networks (DNNs) in different acoustic conditions, with the aim of comparing differences and identifying processing strategies that should be considered in ASR. We find that DNN-based ASR reaches human performance for single-channel, small-vocabulary tasks in the presence of speech-shaped noise and in multi-talker babble noise, which is an important difference to previous human-machine comparisons: The speech reception threshold, i.e., the signal-to-noise ratio with 50% word recognition rate is at about −7 to −8 dB both for HSR and ASR. However, in more complex spatial scenes with diffuse noise and moving talkers, the SRT gap amounts to approximately 12 dB. Based on cross comparisons that use oracle knowledge (e.g., the speakers’ true position), incorrect responses are attributed to localization errors or missing pitch information to distinguish between speakers with different gender. In terms of the SRT, localization errors and missing spectral information amount to 2.1 and 3.2 dB, respectively. The comparison hence identifies specific components in ASR that can profit from learning from auditory signal processing.

[1]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[2]  Anna Warzybok,et al.  The multilingual matrix test: Principles, applications, and comparison across languages: A review , 2015, International journal of audiology.

[3]  Birger Kollmeier,et al.  Development and analysis of an International Speech Test Signal (ISTS) , 2010, International journal of audiology.

[4]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[5]  Volker Hohmann,et al.  Modeling of speech localization in a multi-talker mixture using periodicity and energy-based auditory features. , 2016, The Journal of the Acoustical Society of America.

[6]  Constantin Spille,et al.  Identifying the human-machine differences in complex binaural scenes: what can be learned from our auditory system , 2014, INTERSPEECH.

[7]  T. Brand,et al.  Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model. , 2009, The Journal of the Acoustical Society of America.

[8]  Birger Kollmeier,et al.  Hooking up spectro-temporal filters with auditory-inspired representations for robust automatic speech recognition , 2012, INTERSPEECH.

[9]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[10]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[11]  Birger Kollmeier,et al.  Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests. , 2002, The Journal of the Acoustical Society of America.

[12]  Michael I. Mandel,et al.  Measuring time-frequency importance functions of speech with bubble noise. , 2016, The Journal of the Acoustical Society of America.

[13]  Andreas Stolcke,et al.  Statistical language modeling for speech disfluencies , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  B. Kollmeier,et al.  Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. , 2011, The Journal of the Acoustical Society of America.

[15]  Bernd T. Meyer What's the difference? comparing humans and machines on the Aurora 2 speech recognition task , 2013, INTERSPEECH.

[16]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  C. Schreiner,et al.  Gabor analysis of auditory midbrain receptive fields: spectro-temporal and binaural composition. , 2003, Journal of neurophysiology.

[18]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[19]  Michael J. Carey,et al.  A speech similarity distance weighting for robust recognition , 2005, INTERSPEECH.

[20]  R. Beutelmann,et al.  Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. , 2006, The Journal of the Acoustical Society of America.

[21]  W. Dreschler,et al.  ICRA noises: artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment. International Collegium for Rehabilitative Audiology. , 2001, Audiology : official organ of the International Society of Audiology.

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Joseph P. Olive,et al.  Two protocols comparing human and machine phonetic recognition performance in conversational speech , 2008, INTERSPEECH.

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[25]  Birger Kollmeier,et al.  Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features. , 2016, The Journal of the Acoustical Society of America.

[26]  Dietrich Klakow,et al.  Estimation of Gap Between Current Language Models and Human Performance , 2017, INTERSPEECH.

[27]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[28]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[29]  Birger Kollmeier,et al.  Combining Binaural and Cortical Features for Robust Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Stephen V. David,et al.  Representation of Phonemes in Primary Auditory Cortex: How the Brain Analyzes Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[31]  P. Mahesha,et al.  Support vector machine-based stuttering dysfluency classification using GMM supervectors , 2015, Int. J. Grid Util. Comput..

[32]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[33]  Birger Kollmeier,et al.  Autonomous measurement of speech intelligibility utilizing automatic speech recognition , 2015, INTERSPEECH.

[34]  C. Darwin,et al.  Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. , 2003, The Journal of the Acoustical Society of America.

[35]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[36]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[37]  Guy J. Brown,et al.  Speech segregation based on sound localization , 2003 .

[38]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[39]  Odette Scharenborg,et al.  The interspeech 2008 consonant challenge , 2008, INTERSPEECH.

[40]  B. Kollmeier,et al.  Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction. , 1994, The Journal of the Acoustical Society of America.

[41]  Andreas Stolcke,et al.  Comparing Human and Machine Errors in Conversational Speech Transcription , 2017, INTERSPEECH.

[42]  Tim Jürgens,et al.  Challenging the speech intelligibility index: macroscopic vs. microscopic prediction of sentence recognition in normal and hearing-impaired listeners , 2010, INTERSPEECH.

[43]  Tim Jürgens,et al.  Influence of noise type on speech reception thresholds across four languages measured with matrix sentence tests , 2015, International journal of audiology.

[44]  Sean Connolly,et al.  Improvements in switchboard recognition and topic identification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[45]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[46]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[47]  Volker Hohmann,et al.  Using binarual processing for automatic speech recognition in multi-talker scenes , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  B. Grothe,et al.  Precise inhibition is essential for microsecond interaural time difference coding , 2002, Nature.

[49]  Daniel Jurafsky,et al.  Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates , 2010, Speech Commun..

[50]  Jouko Lampinen,et al.  Rao-Blackwellized particle filter for multiple target tracking , 2007, Inf. Fusion.

[51]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[52]  Tim Jürgens,et al.  Talker- and language-specific effects on speech intelligibility in noise assessed with bilingual talkers: Which language is more robust against noise and reverberation? , 2015, International journal of audiology.

[53]  L D Braida,et al.  Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. , 1994, The Journal of the Acoustical Society of America.