Modeling of speech localization in a multi-talker mixture using periodicity and energy-based auditory features.

A recent study showed that human listeners are able to localize a short speech target simultaneously masked by four speech tokens in reverberation [Kopčo, Best, and Carlile (2010). J. Acoust. Soc. Am. 127, 1450-1457]. Here, an auditory model for solving this task is introduced. The model has three processing stages: (1) extraction of the instantaneous interaural time difference (ITD) information, (2) selection of target-related ITD information ("glimpses") using a template-matching procedure based on periodicity, spectral energy, or both, and (3) target location estimation. The model performance was compared to the human data, and to the performance of a modified model using an ideal binary mask (IBM) at stage (2). The IBM-based model performed similarly to the subjects, indicating that the binaural model is able to accurately estimate source locations. Template matching using spectral energy and using a combination of spectral energy and periodicity achieved good results, while using periodicity alone led to poor results. Particularly, the glimpses extracted from the initial portion of the signal were critical for good performance. Simulation data show that the auditory features investigated here are sufficient to explain human performance in this challenging listening condition and thus may be used in models of auditory scene analysis.

[1]  Volker Hohmann,et al.  Online Monaural Speech Enhancement Based on Periodicity Analysis and A Priori SNR Estimation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  Speech segregation based on sound localization , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[3]  S. Shamma,et al.  Segregation of complex acoustic scenes based on temporal coherence , 2013, eLife.

[4]  S. Carlile,et al.  Speech localization in a multitalker mixture. , 2010, The Journal of the Acoustical Society of America.

[5]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[6]  Mathias Dietz,et al.  Emphasis of spatial cues in the temporal fine structure during the rising segments of amplitude-modulated sounds , 2013, Proceedings of the National Academy of Sciences.

[7]  K. Sen,et al.  A computational model of spatial tuning in the auditory cortex in response to competing sound sources , 2013 .

[8]  S. Shamma,et al.  Temporal coherence and attention in auditory scene analysis , 2011, Trends in Neurosciences.

[9]  S M Abel,et al.  Sound localization: effects of reverberation time, speaker array, stimulus frequency, and stimulus rise/decay. , 1993, The Journal of the Acoustical Society of America.

[10]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[11]  E. Langendijk,et al.  Sound localization in the presence of one or two distracters. , 2001, The Journal of the Acoustical Society of America.

[12]  Barbara G Shinn-Cunningham,et al.  Effect of stimulus spectrum on distance perception for nearby sources. , 2011, The Journal of the Acoustical Society of America.

[13]  T Dau,et al.  A quantitative model of the "effective" signal processing in the auditory system. I. Model structure. , 1996, The Journal of the Acoustical Society of America.

[14]  H. Gockel On possible cues in profile analysis: identification of the incremented component. , 1998, The Journal of the Acoustical Society of America.

[15]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[16]  Tammo Houtgast,et al.  Stimulus-onset dominance in the perception of binaural information , 1994, Hearing Research.

[17]  H. Colonius,et al.  Auditory profile analysis: is there perceptual constancy for spectral shape for stimuli roved in frequency? , 1997, The Journal of the Acoustical Society of America.

[18]  S. Shamma,et al.  Temporal Coherence in the Perceptual Organization and Cortical Representation of Auditory Scenes , 2009, Neuron.

[19]  Yu He,et al.  Hearing Two Things at Once: Neurophysiological Indices of Speech Segregation and Identification , 2005, Journal of Cognitive Neuroscience.

[20]  C. Darwin,et al.  The Quarterly Journal of Experimental Psychology Section a Human Experimental Psychology Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time Perceptual Grouping of Speech Components Differing in Fundamental Frequency and Onset-time , 2022 .

[21]  Volker Hohmann,et al.  Auditory model based direction estimation of concurrent speakers from binaural signals , 2011, Speech Commun..

[22]  Shihab Shamma,et al.  Adaptive auditory computations , 2014, Current Opinion in Neurobiology.

[23]  Virginia Best,et al.  Listening to every other word: examining the strength of linkage variables in forming streams of speech. , 2008, The Journal of the Acoustical Society of America.

[24]  R L Freyman,et al.  Onset dominance in lateralization. , 1997, The Journal of the Acoustical Society of America.

[25]  Barbara G Shinn-Cunningham,et al.  Localizing nearby sound sources in a classroom: binaural room impulse responses. , 2005, The Journal of the Acoustical Society of America.