Sparse periodicity‐based auditory features explain human performance in a spatial multitalker auditory scene analysis task

Human listeners robustly decode speech information from a talker of interest that is embedded in a mixture of spatially distributed interferers. A relevant question is which time‐frequency segments of the speech are predominantly used by a listener to solve such a complex Auditory Scene Analysis task. A recent psychoacoustic study investigated the relevance of low signal‐to‐noise ratio (SNR) components of a target signal on speech intelligibility in a spatial multitalker situation. For this, a three‐talker stimulus was manipulated in the spectro‐temporal domain such that target speech time‐frequency units below a variable SNR threshold (SNRcrit) were discarded while keeping the interferers unchanged. The psychoacoustic data indicate that only target components at and above a local SNR of about 0 dB contribute to intelligibility. This study applies an auditory scene analysis “glimpsing” model to the same manipulated stimuli. Model data are found to be similar to the human data, supporting the notion of “glimpsing,” that is, that salient speech‐related information is predominantly used by the auditory system to decode speech embedded in a mixture of sounds, at least for the tested conditions of three overlapping speech signals. This implies that perceptually relevant auditory information is sparse and may be processed with low computational effort, which is relevant for neurophysiological research of scene analysis and novelty processing in the auditory system.

[1]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[2]  Jayaganesh Swaminathan,et al.  Use of a glimpsing model to understand the performance of listeners with and without hearing loss in spatialized speech mixtures. , 2017, The Journal of the Acoustical Society of America.

[3]  Hirsh Ij Binaural summation and interaural inhibition as a function of the level of masking noise. , 1948 .

[4]  Volker Hohmann,et al.  Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Raymond J. Dolan,et al.  Exploration, novelty, surprise, and free energy minimization , 2013, Front. Psychol..

[6]  Volker Hohmann,et al.  Modeling of speech localization in a multi-talker mixture using periodicity and energy-based auditory features. , 2016, The Journal of the Acoustical Society of America.

[7]  Volker Hohmann,et al.  Modeling speech localization, talker identification, and word recognition in a multi-talker setting. , 2017, The Journal of the Acoustical Society of America.

[8]  J. C. R. Licklider,et al.  The Influence of Interaural Phase Relations upon the Masking of Speech by White Noise , 1948 .

[9]  Esther Schoenmaker,et al.  Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components. , 2016, Advances in experimental medicine and biology.

[10]  Steven van de Par,et al.  A high resolution head-related transfer function database including different orientations of head above the torso , 2013 .

[11]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[12]  C J Darwin,et al.  Listening to speech in the presence of other sounds , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[13]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[14]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[15]  N. Durlach Equalization and Cancellation Theory of Binaural Masking‐Level Differences , 1963 .

[16]  Daniel Patrick Whittlesey Ellis,et al.  Prediction-driven computational auditory scene analysis , 1996 .

[17]  Volker Hohmann,et al.  Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  L. Rabiner,et al.  Predicting binaural gain in intelligibility and release from masking for speech. , 1967, Journal of the Acoustical Society of America.

[19]  R. Beutelmann,et al.  Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. , 2006, The Journal of the Acoustical Society of America.

[20]  I. Hirsh,et al.  Binaural summation and interaural inhibition as a function of the level of masking noise. , 1948, The American journal of psychology.

[21]  Birger Kollmeier,et al.  Revision, extension, and evaluation of a binaural speech intelligibility model. , 2010, The Journal of the Acoustical Society of America.