Speech Localisation in a Multitalker Mixture by Humans and Machines

Speech localisation in multitalker mixtures is affected by the listener’s expectations about the spatial arrangement of the sound sources. This effect was investigated via experiments with human listeners and a machine system, in which the task was to localise a female-voice target among four spatially distributed male-voice maskers. Two configurations were used: either the masker locations were fixed or the locations varied from trial-to-trial. The machine system uses deep neural networks (DNNs) to learn the relationship between binaural cues and source azimuth, and exploits top-down knowledge about the spectral characteristics of the target source. Performance was examined in both anechoic and reverberant conditions. Our experiments show that the machine system outperformed listeners in some conditions. Both the machine and listeners were able to make use of a priori knowledge about the spatial configuration of the sources, but the effect for headphone listening was smaller than that previously reported for listening in a real room.

[1]  Virginia Best,et al.  Listening to every other word: examining the strength of linkage variables in forming streams of speech. , 2008, The Journal of the Acoustical Society of America.

[2]  Tim Brookes,et al.  Dynamic Precedence Effect Modeling for Source Separation in Reverberant Environments , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[4]  Durand R. Begault,et al.  Perceptual Effects of Synthetic Reverberation on Three-Dimensional Audio Systems , 1992 .

[5]  Sascha Spors,et al.  A Free Database of Head Related Impulse Response Measurements in the Horizontal Plane with Multiple Distances , 2011 .

[6]  Guy J. Brown,et al.  Exploiting top-down source models to improve binaural localisation of multiple sources in reverberant environments , 2015, INTERSPEECH.

[7]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[8]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[9]  E. C. Cherry Some Experiments on the Recognition of Speech, with One and with Two Ears , 1953 .

[10]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[11]  S. Carlile,et al.  Speech localization in a multitalker mixture. , 2010, The Journal of the Acoustical Society of America.