Speaker turn tracking with mobile microphones: Combining location and pitch information

This paper considers the problem of using binaural microphones to track speakers in a situation where the microphones are themselves in motion (i.e. due to listener head movement). We present a general framework that applies particle filtering to combine sequential interaural time difference (ITD) cues with noisy sensor motion data. The framework is demonstrated in a meeting scenario applied to a moving-listener version of a speaker-diarization task. The paper extends previous work by investigating two potentially complementary ways of exploiting pitch track estimates in this framework, either, i) informing the time points at which speaker turn changes may occur, or ii) improving the ITD estimates by allowing integration over spectro-temporal regions grouped by pitch. Experiments using real meeting scenario recordings, made with in-ear binaural microphones, show that the latter approach leads to large and significant reductions in diarization error rate.

[1]  Brian R Glasberg,et al.  Derivation of auditory filter shapes from notched-noise data , 1990, Hearing Research.

[2]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[3]  Kristine L. Bell,et al.  A Tutorial on Particle Filters for Online Nonlinear/NonGaussian Bayesian Tracking , 2007 .

[4]  Ning Ma,et al.  Exploiting correlogram structure for robust speech recognition with multiple speech sources , 2007, Speech Commun..

[5]  Masakiyo Fujimoto,et al.  A speaker diarization method based on the probabilistic fusion of audio-visual location information , 2009, ICMI-MLMI '09.

[6]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Jon Barker,et al.  Using location cues to track speaker changes from mobile, binaural microphones , 2009, INTERSPEECH.

[8]  Gerald Friedland,et al.  An adaptive initialization method for speaker Diarization based on prosodic features , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Ning Ma,et al.  A speech fragment approach to localising multiple speakers in reverberant environments , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Simon Carlile,et al.  Compression of auditory space during rapid head turns , 2008, Proceedings of the National Academy of Sciences.

[11]  Jacob Benesty,et al.  Robust time delay estimation exploiting redundancy among multiple microphones , 2003, IEEE Trans. Speech Audio Process..

[12]  Jon Barker,et al.  The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements , 2008, ICMI '08.

[13]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[14]  Anthony G. Constantinides,et al.  Estimation of direction of arrival using information theory , 2005, IEEE Signal Processing Letters.