Further Applications of Sector-Based Detection and Short-Term Clustering

This paper presents an effective implementation of detection-localization of multiple speech sources with microphone arrays. In particular, the Scaled Conjugate Gradient descent is used for fast and precise localization, within a pre-detected volume of space. The approach is fit for real-time implementation. An unsupervised approach to speech/non-speech discrimination is also proposed. The integrated system is then successfully applied to segmentation of spontaneous multi-party speech, as found in meetings. Based on this system, the unsupervised speaker clustering task is then investigated, using distant microphones only. This task is challenging due to the poor quality of the signal and the fast-changing speaker turns encountered in spontaneous speech. An extension of the BIC criterion to multiple modalities is proposed, allowing to combine the strengths of speaker location information -- useful in the short term -- and acoustic speaker information, i.e. MFCCs -- useful in the longer term. A dramatic improvement in speaker clustering results is obtained by the combined approach, as compared with the acoustic-alone approach, and results are close to those obtained with close-talking microphones. Finally, an initial investigation on automatic audio-visual calibration is exposed.

[1]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[2]  Volker Hohmann,et al.  Sound source localization in real sound fields based on empirical statistics of interaural parameters. , 2006, The Journal of the Acoustical Society of America.

[3]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[4]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Longbiao Wang,et al.  Robust distant speaker recognition based on position dependent cepstral mean normalization , 2005, INTERSPEECH.

[6]  Joseph H. DiBiase A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays , 2000 .

[7]  Iain McCowan,et al.  A sector-based approach for localization of multiple speakers with microphone arrays , 2004, SAPA@INTERSPEECH.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Wee Ser,et al.  Speech detection using microphone array , 2000 .

[10]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[11]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .

[12]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[13]  J. Odobez,et al.  AV 16 . 3 : An Audio-Visual Corpus for Speaker Localization and Tracking , .

[14]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[15]  S. Rice Mathematical analysis of random noise , 1944 .

[16]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[17]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[18]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[19]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[20]  Klaus Obermayer,et al.  Correlation and stationarity of speech radiation: consequences for linear multichannel filtering , 2004, IEEE Transactions on Speech and Audio Processing.

[21]  Guillaume Lathoud Channel Normalization for Unsupervised Spectral Subtraction , 2006 .

[22]  Nelson Morgan,et al.  Evaluating long-term spectral subtraction for reverberant ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[23]  N. Wax,et al.  Selected Papers on Noise and Stochastic Processes , 1955 .

[24]  Hervé Bourlard,et al.  Threshold Selection for Unsupervised Detection, With an Application to Microphone Arrays , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[26]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[27]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech , 2003 .

[28]  Julien Bourgeois,et al.  Sector-Based Detection for Hands-Free Speech Enhancement in Cars , 2006, EURASIP J. Adv. Signal Process..

[29]  Iain McCowan,et al.  Clustering and segmenting speakers and their locations in meetings , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Y. Grenier Wideband source location through frequency-dependent modeling , 1994, IEEE Trans. Signal Process..

[32]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[33]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34]  Jean-Yves Bouguet,et al.  Camera calibration toolbox for matlab , 2001 .

[35]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[36]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .