Short-Term Spatio–Temporal Clustering Applied to Multiple Moving Speakers

Distant microphones permit to process spontaneous multiparty speech with very little constraints on speakers, as opposed to close-talking microphones. Minimizing the constraints on speakers permits a large diversity of applications, including meeting summarization and browsing, surveillance, hearing aids, and more natural human-machine interaction. Such applications of distant microphones require to determine where and when the speakers are talking. This is inherently a multisource problem, because of background noise sources, as well as the natural tendency of multiple speakers to talk over each other. Moreover, spontaneous speech utterances are highly discontinuous, which makes it difficult to track the multiple speakers with classical filtering approaches, such as Kalman filtering of particle filters. As an alternative, this paper proposes a probabilistic framework to determine the trajectories of multiple moving speakers in the short-term only, i.e., only while they speak. Instantaneous location estimates that are close in space and time are grouped into ldquoshort-term clustersrdquo in a principled manner. Each short-term cluster determines the precise start and end times of an utterance and a short-term spatial trajectory. Contrastive experiments clearly show the benefit of using short-term clustering, on real indoor recordings with seated speakers in meetings, as well as multiple moving speakers.

[1]  Jean-Marc Odobez,et al.  Embedding Motion in Model-Based Stochastic Tracking , 2004, IEEE Transactions on Image Processing.

[2]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[3]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[4]  James P. Reilly,et al.  Particle filters for tracking an unknown number of sources , 2002, IEEE Trans. Signal Process..

[5]  Hervé Bourlard,et al.  Threshold Selection for Unsupervised Detection, With an Application to Microphone Arrays , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Daniel P. W. Ellis,et al.  Speaker turn segmentation based on between-channel differences , 2004 .

[7]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[8]  Stan Z. Li,et al.  Markov Random Field Modeling in Computer Vision , 1995, Computer Science Workbench.

[9]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Eric A. Lehmann,et al.  Particle filtering methods for acoustic source localisation and tracking , 2004 .

[11]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[12]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[13]  Jr. J.J. LaViola,et al.  A comparison of unscented and extended Kalman filtering for estimating quaternion motion , 2003, Proceedings of the 2003 American Control Conference, 2003..

[14]  Sheldon Howard Jacobson,et al.  The Theory and Practice of Simulated Annealing , 2003, Handbook of Metaheuristics.

[15]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[16]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[17]  Trevor Darrell,et al.  Multiple person and speaker activity tracking with a particle filter , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Patrick Pérez,et al.  Maintaining multimodality through mixture tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[21]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .

[22]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Manfai Fong,et al.  Real-time implementation of MUSIC for wideband acoustic detection and tracking , 1997, Defense, Security, and Sensing.

[24]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[25]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[26]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  J. L. Roux An Introduction to the Kalman Filter , 2003 .

[29]  Jean-Marc Odobez,et al.  Multimodal multispeaker probabilistic tracking in meetings , 2005, ICMI '05.

[30]  Jean Rouat,et al.  Robust 3D Localization and Tracking of Sound Sources Using Beamforming and Particle Filtering , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[31]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech , 2003 .

[32]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[33]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[34]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[35]  Raffaele Parisi,et al.  Multi-Source Localization Strategies , 2001, Microphone Arrays.

[36]  Guillaume Lathoud,et al.  Further Applications of Sector-Based Detection and Short-Term Clustering , 2006 .

[37]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[38]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[39]  Julien Bourgeois,et al.  Sector-Based Detection for Hands-Free Speech Enhancement in Cars , 2006, EURASIP J. Adv. Signal Process..

[40]  H.F. Durrant-Whyte,et al.  A new approach for filtering nonlinear systems , 1995, Proceedings of 1995 American Control Conference - ACC'95.

[41]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[42]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[43]  Ramesh Harjani,et al.  Acoustic feedback cancellation in hearing aids , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[44]  Jorge S. Marques,et al.  Estimation of the Bayesian network architecture for object tracking in video sequences , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[45]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[46]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[47]  Iain McCowan,et al.  Clustering and segmenting speakers and their locations in meetings , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[48]  Darren B. Ward,et al.  Particle filter beamforming for acoustic source localization in a reverberant environment , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[49]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Ann E. Wells,et al.  Stars in the sky , 1973 .

[51]  H. W. Sorenson,et al.  Kalman filtering : theory and application , 1985 .

[52]  John W. McDonough,et al.  Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[53]  Naoyuki Ichimura,et al.  An Application of a Particle Filter to Bayesian Multiple Sound Source Tracking with Audio and Video Information Fusion , 2004 .

[54]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[55]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[56]  Eric. Lehmann,et al.  IMPORTANCE SAMPLING PARTICLE FILTER FOR ROBUST ACOUSTIC SOURCE LOCALISATION AND TRACKING IN REVERBERANT ENVIRONMENTS , 2004 .

[57]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .