Identifying the human-machine differences in complex binaural scenes: what can be learned from our auditory system

Previous comparisons of human speech recognition (HSR) and automatic speech recognition (ASR) focused on monaural signals in additive noise, and showed that HSR is far more robust against intrinsic and extrinsic sources of variation than conventional ASR. The aim of this study is to analyze the man-machine gap (and its causes) in more complex acoustic scenarios, particularly in scenes with two moving speakers, reverberation and diffuse noise. Responses of nine normal-hearing listeners are compared to errors of an ASR system that employs a binaural model for direction-of-arrival estimation and beamforming for signal enhancement. The overall man-machine gap is measured in terms for the speech recognition threshold (SRT), i.e., the signal-to-noise ratio at which a 50 % recognition rate is obtained. The comparison shows that the gap amounts to 16.7 dB SRT difference which exceeds the difference of 10 dB found in monaural situations. Based on cross comparisons that use oracle knowledge (e.g., the speakers’ true position), incorrect responses are attributed to localization errors (7 dB) or missing spectral information to distinguish between speakers with different gender (3 dB). The comparison hence identifies specific ASR components that can profit from learning from binaural auditory signal processing.

[1]  E. B. Newman,et al.  The precedence effect in sound localization. , 1949, The American journal of psychology.

[2]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[3]  Henry Cox,et al.  Robust adaptive beamforming , 2005, IEEE Trans. Acoust. Speech Signal Process..

[4]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[5]  Steve Young,et al.  The HTK book , 1995 .

[6]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[7]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[8]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[9]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[10]  Masaki Naito,et al.  Speaker clustering for speech recognition using vocal tract parameters , 2002, Speech Commun..

[11]  C. Darwin,et al.  Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. , 2003, The Journal of the Acoustical Society of America.

[12]  Petre Stoica,et al.  Robust Adaptive Beamforming: Li/Robust Adaptive Beamforming , 2005 .

[13]  DeLiang Wang,et al.  Binaural segregation in multisource reverberant environments. , 2006, The Journal of the Acoustical Society of America.

[14]  Odette Scharenborg,et al.  Reaching over the gap: A review of efforts to link human and automatic speech recognition research , 2007, Speech Commun..

[15]  Jouko Lampinen,et al.  Rao-Blackwellized particle filter for multiple target tracking , 2007, Inf. Fusion.

[16]  Joseph P. Olive,et al.  Two protocols comparing human and machine phonetic recognition performance in conversational speech , 2008, INTERSPEECH.

[17]  S. Särkkä,et al.  RBMCDAbox-Matlab Toolbox of Rao-Blackwellized Data Association Particle Filters , 2008 .

[18]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[19]  Guy J. Brown,et al.  A speech-in-noise test based on spoken digits: comparison of normal and impaired listeners using a computer model , 2010, INTERSPEECH.

[20]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[21]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[22]  B. Kollmeier,et al.  Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. , 2012, The Journal of the Acoustical Society of America.

[23]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Volker Hohmann,et al.  Using binarual processing for automatic speech recognition in multi-talker scenes , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Volker Hohmann,et al.  Binaural Scene Analysis with Multidimensional Statistical Filters , 2013 .

[26]  Tim Jürgens,et al.  NOISE ROBUST DISTANT AUTOMATIC SPEECH RECOGNITION UTILIZING NMF BASED SOURCE SEPARATION AND AUDITORY FEATURE EXTRACTION , 2013 .

[27]  Bernd T. Meyer What's the difference? comparing humans and machines on the Aurora 2 speech recognition task , 2013, INTERSPEECH.