A speech fragment approach to localising multiple speakers in reverberant environments

Sound source localisation cues are severely degraded when multiple acoustic sources are active in the presence of reverberation. We present a binaural system for localising simultaneous speakers which exploits the fact that in a speech mixture there exist spectro-temporal regions or ‘fragments’, where the energy is dominated by just one of the speakers. A fragment-level localisation model is proposed that integrates the localisation cues within a fragment using a weighted mean. The weights are based on local estimates of the degree of reverberation in a given spectro-temporal cell. The paper investigates different weight estimation approaches based variously on, i) an established model of the perceptual precedence effect; ii) a measure of interaural coherence between the left and right ear signals; iii) a data-driven approach trained in matched acoustic conditions. Experiments with reverberant binaural data with two simultaneous speakers show appropriate weighting can improve frame-based localisation performance by up to 24%.

[1]  H. Gaskell The precedence effect , 1983, Hearing Research.

[2]  Ning Ma,et al.  Integrating pitch and localisation cues at a speech fragment level , 2007, INTERSPEECH.

[3]  Martin Heckmann,et al.  Auditory Inspired Binaural Robust Sound Source Localization in Echoic and Noisy Environments , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  C. Faller,et al.  Source localization in complex listening situations: selection of binaural cues based on interaural coherence. , 2004, The Journal of the Acoustical Society of America.

[5]  Guy J. Brown,et al.  A Classification-based Cocktail-party Processor , 2003, NIPS.

[6]  Trevor Darrell,et al.  Improving audio source localization by learning the precedence effect , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[8]  Sampo Vesa,et al.  Automatic estimation of reverberation time from binaural signals , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Daniel P. W. Ellis,et al.  An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments , 2006, NIPS.

[10]  Ning Ma,et al.  Exploiting dendritic autocorrelogram structure to identify spectro-temporal regions dominated by a single sound source , 2006, INTERSPEECH.

[11]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[12]  Douglas L. Jones,et al.  Fast algorithms for blind estimation of reverberation time , 2004, IEEE Signal Processing Letters.

[13]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[14]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[15]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.