Listening in the Dips: Comparing Relevant Features for Speech Recognition in Humans and Machines

In recent years, automatic speech recognition (ASR) systems gradually decreased (and for some tasks closed) the gap between human and automatic speech recognition. However, it is unclear if similar performance implies humans and ASR systems to rely on similar signal cues. In the current study, ASR and HSR are compared using speech material from a matrix sentence test mixed with either a stationary speech-shaped noise (SSN) or amplitude-modulated SSN. Recognition performance of HSR and ASR is measured in term of the speech recognition threshold (SRT), i.e., the signal-to-noise ratio with 50% recognition rate and by comparing psychometric functions. ASR results are obtained with matched-trained DNN-based systems that use FBank features as input and compared to results obtained from eight normal-hearing listeners and two established models of speech intelligibility. For both maskers, HSR and ASR achieve similar SRTs with an average deviation of only 0.4 dB. A relevance propagation algorithm is applied to identify features relevant for ASR. The analysis shows that relevant features coincide either with spectral peaks of the speech signal or with dips of the noise masker, indicating that similar cues are important in HSR and ASR.

[1]  Michael I. Mandel,et al.  Measuring time-frequency importance functions of speech with bubble noise. , 2016, The Journal of the Acoustical Society of America.

[2]  Birger Kollmeier,et al.  Autonomous measurement of speech intelligibility utilizing automatic speech recognition , 2015, INTERSPEECH.

[3]  Martin Cooke,et al.  Glimpsing speech , 2003, J. Phonetics.

[4]  Birger Kollmeier,et al.  Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features. , 2016, The Journal of the Acoustical Society of America.

[5]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[6]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[7]  Michael J. Carey,et al.  A speech similarity distance weighting for robust recognition , 2005, INTERSPEECH.

[8]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[9]  Klaus-Robert Müller,et al.  Interpretable deep neural networks for single-trial EEG classification , 2016, Journal of Neuroscience Methods.

[10]  Louis D. Braida,et al.  Human and machine consonant recognition , 2005, Speech Commun..

[11]  Anna Warzybok,et al.  The multilingual matrix test: Principles, applications, and comparison across languages: A review , 2015, International journal of audiology.

[12]  DeLiang Wang,et al.  The role of binary mask patterns in automatic speech recognition in background noise. , 2013, The Journal of the Acoustical Society of America.

[13]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[14]  Frédéric Berthommier,et al.  Masking release for consonant features in temporally fluctuating background noise , 2006, Hearing Research.

[15]  B C Moore,et al.  Speech reception thresholds in noise with and without spectral and temporal dips for hearing-impaired and normally hearing people. , 1998, The Journal of the Acoustical Society of America.

[16]  Torsten Dau,et al.  A multi-resolution envelope-power based model for speech intelligibility. , 2013, The Journal of the Acoustical Society of America.

[17]  John R. Hershey,et al.  Super-human multi-talker speech recognition: A graphical modeling approach , 2010, Comput. Speech Lang..

[18]  Birger Kollmeier,et al.  Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition , 2011, Speech Commun..

[19]  G. A. Miller,et al.  The Intelligibility of Interrupted Speech , 1948 .

[20]  Bernd T. Meyer What's the difference? comparing humans and machines on the Aurora 2 speech recognition task , 2013, INTERSPEECH.

[21]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[22]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[23]  Birger Kollmeier,et al.  Development and analysis of an International Speech Test Signal (ISTS) , 2010, International journal of audiology.