Modeling speech localization, talker identification, and word recognition in a multi-talker setting.

This study introduces a model for solving three different auditory tasks in a multi-talker setting: target localization, target identification, and word recognition. The model was used to simulate psychoacoustic data from a call-sign-based listening test involving multiple spatially separated talkers [Brungart and Simpson (2007). Percept. Psychophys. 69(1), 79-91]. The main characteristics of the model are (i) the extraction of salient auditory features ("glimpses") from the multi-talker signal and (ii) the use of a classification method that finds the best target hypothesis by comparing feature templates from clean target signals to the glimpses derived from the multi-talker mixture. The four features used were periodicity, periodic energy, and periodicity-based interaural time and level differences. The model results widely exceeded probability of chance for all subtasks and conditions, and generally coincided strongly with the subject data. This indicates that, despite their sparsity, glimpses provide sufficient information about a complex auditory scene. This also suggests that complex source superposition models may not be needed for auditory scene analysis. Instead, simple models of clean speech may be sufficient to decode even complex multi-talker scenes.

[1]  Daniel Pressnitzer,et al.  Rapid Formation of Robust Auditory Memories: Insights from Noise , 2010, Neuron.

[2]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[3]  D S Brungart,et al.  Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[4]  Volker Hohmann,et al.  Combined Estimation of Spectral Envelopes and Sound Source Direction of Concurrent Voices by Multidimensional Statistical Filtering , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[7]  Brian C J Moore,et al.  Properties of auditory stream formation , 2012, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  Volker Hohmann,et al.  Modeling of speech localization in a multi-talker mixture using periodicity and energy-based auditory features. , 2016, The Journal of the Acoustical Society of America.

[9]  W. M. Rabinowitz,et al.  Auditory localization of nearby sources. Head-related transfer functions. , 1999, The Journal of the Acoustical Society of America.

[10]  R. Gregory,et al.  Knowledge in perception and illusion. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[11]  Raymond J. Dolan,et al.  Exploration, novelty, surprise, and free energy minimization , 2013, Front. Psychol..

[12]  Ruth Y Litovsky,et al.  The role of head-induced interaural time and level differences in the speech reception threshold for multiple interfering sound sources. , 2004, The Journal of the Acoustical Society of America.

[13]  W. T. Nelson,et al.  A speech corpus for multitalker communications research. , 2000, The Journal of the Acoustical Society of America.

[14]  Esther Schoenmaker,et al.  Intelligibility for Binaural Speech with Discarded Low-SNR Speech Components. , 2016, Advances in experimental medicine and biology.

[15]  A. Pouget,et al.  Probabilistic brains: knowns and unknowns , 2013, Nature Neuroscience.

[16]  Peter S Chang,et al.  Exploration of Behavioral, Physiological, and Computational Approaches to Auditory Scene Analysis , 2004 .

[17]  J. Culling,et al.  Perceptual and computational separation of simultaneous vowels: cues arising from low-frequency beating. , 1994, The Journal of the Acoustical Society of America.

[18]  Liang Li,et al.  Human auditory cortex activity shows additive effects of spectral and spatial cues during speech segregation. , 2011, Cerebral cortex.

[19]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with the same fundamental frequency. , 1989, The Journal of the Acoustical Society of America.

[20]  W A Yost,et al.  Discriminations of interaural phase differences. , 1974, The Journal of the Acoustical Society of America.

[21]  C J Darwin,et al.  Listening to speech in the presence of other sounds , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[22]  Ruth Y Litovsky,et al.  A cocktail party model of spatial release from masking by both noise and speech interferers. , 2011, The Journal of the Acoustical Society of America.

[23]  W. Yost,et al.  Discrimination of interaural differences of level as a function of frequency. , 1988, The Journal of the Acoustical Society of America.

[24]  Brian Roberts,et al.  Effects of differences in fundamental frequency on across-formant grouping in speech perception. , 2009, The Journal of the Acoustical Society of America.

[25]  R. W. Hukin,et al.  Auditory objects of attention: the role of interaural time differences. , 1999, Journal of experimental psychology. Human perception and performance.

[26]  H S Colburn,et al.  Speech intelligibility and localization in a multi-source environment. , 1999, The Journal of the Acoustical Society of America.

[27]  Stuart Gatehouse,et al.  Perceptual segregation of competing speech sounds: the role of spatial location. , 1999, The Journal of the Acoustical Society of America.

[28]  Volker Hohmann,et al.  Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  Jayaganesh Swaminathan,et al.  On the Contribution of Target Audibility to Performance in Spatialized Speech Mixtures. , 2016, Advances in experimental medicine and biology.

[30]  Martin Cooke,et al.  A glimpsing model of speech perception in noise. , 2006, The Journal of the Acoustical Society of America.

[31]  DeLiang Wang,et al.  A model for multitalker speech perception. , 2008, The Journal of the Acoustical Society of America.

[32]  Douglas S Brungart,et al.  Cocktail party listening in a dynamic multitalker environment , 2007, Perception & psychophysics.

[33]  DeLiang Wang,et al.  Monaural speech segregation based on pitch tracking and amplitude modulation , 2002, IEEE Transactions on Neural Networks.

[34]  C. Darwin,et al.  Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. , 2003, The Journal of the Acoustical Society of America.

[35]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[36]  Mark A. Ericson,et al.  Factors That Influence Intelligibility in Multitalker Speech Displays , 2004 .

[37]  Daniel P. W. Ellis,et al.  Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures , 1999, Speech Commun..

[38]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .