Unsupervised Location-Based Segmentation of Multi-Party Speech

Accurate detection and segmentation of spontaneous multi-party speech is crucial for a variety of applications, including speech acquisition and recognition, as well as higher-level event recognition. However, the highly sporadic nature of spontaneous speech makes this task difficult. Moreover, multi-party speech contains many overlaps. We propose to attack this problem as a tracking task, using location cues only. In order to best deal with high sporadicity, we propose a novel, generic, short-term clustering algorithm that can track multiple objects for a low computational cost. The proposed approach is online, fully deterministic and can run in real-time. In an application to real meeting data, the algorithm produces high precision speech segmentation.

[1]  H.F. Durrant-Whyte,et al.  A new approach for filtering nonlinear systems , 1995, Proceedings of 1995 American Control Conference - ACC'95.

[2]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[4]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[5]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.

[6]  H. W. Sorenson,et al.  Kalman filtering : theory and application , 1985 .

[7]  Jr. J.J. LaViola,et al.  A comparison of unscented and extended Kalman filtering for estimating quaternion motion , 2003, Proceedings of the 2003 American Control Conference, 2003..

[8]  James P. Reilly,et al.  Particle filters for tracking an unknown number of sources , 2002, IEEE Trans. Signal Process..

[9]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[10]  Jean-Marc Odobez,et al.  Short-Term Spatio-Temporal Clustering of Sporadic and Concurrent Events , 2004 .

[11]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[12]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[13]  Iain McCowan,et al.  Clustering and segmenting speakers and their locations in meetings , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  J Ma,et al.  Filtering theory and application of wavelet optics at the spatial-frequency domain. , 2001, Applied optics.