Spatio-Temporal Analysis of Spontaneous Speech with Microphone Arrays

Accurate detection, localization and tracking of multiple moving speakers permits a wide spectrum of applications. Techniques are required that are versatile, robust to environmental variations, and not constraining for non-technical end-users. Based on distant recording of spontaneous multiparty conversations, this thesis focuses on the use of microphone arrays to address the question Who spoke where and when?. The speed, the versatility and the robustness of the proposed techniques are tested on a variety of real indoor recordings, including multiple moving speakers as well as seated speakers in meetings. Optimized implementations are provided in most cases. We propose to discretize the physical space into a few sectors, and for each time frame, to determine which sectors contain active acoustic sources (Where? When?). A topological interpretation of beamforming is proposed, which permits both to evaluate the average acoustic energy in a sector for a negligible cost, and to locate precisely a speaker within an active sector. One additional contribution that goes beyond the eld of microphone arrays is a generic, automatic threshold selection method, which does not require any training data. On the speaker detection task, the new approach is dramatically superior to the more classical approach where a threshold is set on training data. We use the new approach into an integrated system for multispeaker detection-localization. Another generic contribution is a principled, threshold-free, framework for short-term clustering of multispeaker location estimates, which also permits to detect where and when multiple trajectories intersect. On multi-party meeting recordings, using distant microphones only, short-term clustering yields a speaker segmentation performance similar to that of close-talking microphones. The resulting short speech segments are then grouped into speaker clusters (Who?), through an extension of the Bayesian Information Criterion to merge multiple modalities. On meeting recordings, the speaker clustering performance is signicantly improved by merging the classical mel-cepstrum information with the short-term speaker location information. Finally, a close analysis of the speaker clustering results suggests that future research should investigate the effect of human acoustic radiation characteristics on the overall transmission channel, when a speaker is a few meters away from a microphone.

[1]  Neri Merhav,et al.  A competitive Neyman-Pearson approach to universal hypothesis testing with applications , 2002, IEEE Trans. Inf. Theory.

[2]  Jean-Marc Odobez,et al.  A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking , 2003, ICCV 2003.

[3]  Iain McCowan,et al.  A sector-based approach for localization of multiple speakers with microphone arrays , 2004, SAPA@INTERSPEECH.

[4]  Guillaume Lathoud,et al.  Further Applications of Sector-Based Detection and Short-Term Clustering , 2006 .

[5]  M. Sugiyama,et al.  Speech segmentation and clustering based on speaker features , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Ying Yu,et al.  Performance of real-time source-location estimators for a large-aperture microphone array , 2005, IEEE Transactions on Speech and Audio Processing.

[7]  Wee Ser,et al.  Speech detection using microphone array , 2000 .

[8]  Walter Kellermann A self-steering digital microphone array , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Jr. J.J. LaViola,et al.  A comparison of unscented and extended Kalman filtering for estimating quaternion motion , 2003, Proceedings of the 2003 American Control Conference, 2003..

[10]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[11]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[12]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[13]  Henning Puder,et al.  Step-size control for acoustic echo cancellation filters - an overview , 2000, Signal Process..

[14]  Greg Welch,et al.  Welch & Bishop , An Introduction to the Kalman Filter 2 1 The Discrete Kalman Filter In 1960 , 1994 .

[15]  Alan V. Oppenheim,et al.  Discrete-time Signal Processing. Vol.2 , 2001 .

[16]  Larry S. Davis,et al.  Active speech source localization by a dual coarse-to-fine search , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[18]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[19]  Hong Wang,et al.  Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources , 1985, IEEE Trans. Acoust. Speech Signal Process..

[20]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[21]  Hervé Bourlard,et al.  Threshold Selection for Unsupervised Detection, With an Application to Microphone Arrays , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[22]  Daniel P. W. Ellis,et al.  Speech/music discrimination based on posterior probability features , 1999, EUROSPEECH.

[23]  Jitendra Ajmera,et al.  Robust audio segmentation , 2004 .

[24]  Ahmed H. Tewfik,et al.  On the application of uniform linear array bearing estimation techniques to uniform circular arrays , 1992, IEEE Trans. Signal Process..

[25]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[26]  J. Odobez,et al.  Embedding Motion in Model-Based Stochastic Tracking , 2004, IEEE Transactions on Image Processing.

[27]  Patrick Pérez,et al.  Color-Based Probabilistic Tracking , 2002, ECCV.

[28]  Jean-Marc Odobez,et al.  AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking , 2004, MLMI.

[29]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .

[30]  Iain McCowan,et al.  Segmenting multiple concurrent speakers using microphone arrays , 2003, INTERSPEECH.

[31]  Eugene Lukacs The Probability Space , 1972 .

[32]  Michael Shapiro Brandstein,et al.  A framework for speech source localization using sensor arrays , 1995 .

[33]  Steven W. Smith,et al.  The Scientist and Engineer's Guide to Digital Signal Processing , 1997 .

[34]  I. McCowan,et al.  PROBABILISTIC TRACKING OF MULTIPLE SPEAKERS IN MEETINGS , 2007 .

[35]  David A. Demer,et al.  Characterization of scatterer motion in a reverberant medium , 2006 .

[36]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[37]  Eric. Lehmann,et al.  IMPORTANCE SAMPLING PARTICLE FILTER FOR ROBUST ACOUSTIC SOURCE LOCALISATION AND TRACKING IN REVERBERANT ENVIRONMENTS , 2004 .

[38]  Darren Moore,et al.  The IDIAP Smart Meeting Room , 2002 .

[39]  L. J. Griffiths,et al.  An alternative approach to linearly constrained adaptive beamforming , 1982 .

[40]  T. Moon,et al.  Mathematical Methods and Algorithms for Signal Processing , 1999 .

[41]  Andrew Blake,et al.  Nonlinear filtering for speaker tracking in noisy and reverberant environments , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[42]  N. Gordon,et al.  Novel approach to nonlinear/non-Gaussian Bayesian state estimation , 1993 .

[43]  S. Rice Mathematical analysis of random noise , 1944 .

[44]  Samy Bengio,et al.  Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[45]  Jean-Marc Odobez,et al.  Tracking People in Meetings with Particles , 2005 .

[46]  Anthony J. Weiss,et al.  Coherent wide-band processing for arbitrary array geometry , 1993, IEEE Transactions on Signal Processing.

[47]  Samy Bengio,et al.  Multimodal group action clustering in meetings , 2004, VSSN '04.

[48]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[49]  C. Striebel,et al.  On the maximum likelihood estimates for linear dynamic systems , 1965 .

[50]  Iain McCowan,et al.  Clustering and segmenting speakers and their locations in meetings , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Darren B. Ward,et al.  Particle filter beamforming for acoustic source localization in a reverberant environment , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[52]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[53]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[54]  Stefan Bilbao,et al.  Proceedings of the European Signal Processing Conference , 2005 .

[55]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[57]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Processing of Multi-Party Meetings? Evidence from Predicting Punctuation, Disfluencies, and Overlapping Speech , 2003 .

[58]  E. Bell,et al.  Science and Sanity. , 1934 .

[59]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[60]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[61]  Hervé Bourlard,et al.  Robust HMM-based speech/music segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[62]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[63]  Bin Chen,et al.  Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian speech modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[64]  Samy Bengio,et al.  Modeling human interaction in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[65]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[66]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[67]  Yong Rui,et al.  Sound source localization for circular arrays of directional microphones , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[68]  Stan Z. Li,et al.  Markov Random Field Modeling in Computer Vision , 1995, Computer Science Workbench.

[69]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[70]  Sheldon Howard Jacobson,et al.  The Theory and Practice of Simulated Annealing , 2003, Handbook of Metaheuristics.

[71]  Jean-Marc Odobez,et al.  Evaluating Multi-Object Tracking , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[72]  Eric A. Lehmann,et al.  Particle filtering methods for acoustic source localisation and tracking , 2004 .

[73]  Guillaume Lathoud,et al.  Observations on Multi-Band Asynchrony in Distant Speech Recordings , 2006 .

[74]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[75]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[76]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Patrick Pérez,et al.  Maintaining multimodality through mixture tracking , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[78]  Jean-Jacques Fuchs On the application of the global matched filter to DOA estimation with uniform circular arrays , 2001, IEEE Trans. Signal Process..

[79]  Guillaume Lathoud Channel Normalization for Unsupervised Spectral Subtraction , 2006 .

[80]  Nelson Morgan,et al.  Evaluating long-term spectral subtraction for reverberant ASR , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[81]  Walter Kellermann,et al.  RELATION BETWEEN BLIND SYSTEM IDENTIFICATION AND CONVOLUTIVE BLIND SOURCE SEPARATION , 2005 .

[82]  Manfai Fong,et al.  Real-time implementation of MUSIC for wideband acoustic detection and tracking , 1997, Defense, Security, and Sensing.

[83]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[84]  Jean-Marc Odobez,et al.  Audio-visual speaker tracking with importance particle filters , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[85]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[86]  Elizabeth Shriberg,et al.  Speaker Overlaps and ASR Errors in Meetings: Effects Before, During, and After the Overlap , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[87]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[88]  Julien Bourgeois,et al.  Multichannel Speech Enhancement in Cars: Explicit vs. Implicit Adaptation Control , 2005 .

[89]  Jean-Marc Odobez,et al.  Embedding motion in model-based stochastic tracking , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[90]  Daniel P. W. Ellis,et al.  Speaker turn segmentation based on between-channel differences , 2004 .

[91]  B. Hofmann-Wellenhof,et al.  Introduction to spectral analysis , 1986 .

[92]  Y. Grenier Wideband source location through frequency-dependent modeling , 1994, IEEE Trans. Signal Process..

[93]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[94]  Dirk Van Compernolle Noise adaptation in a hidden Markov model speech recognition system , 1989 .

[95]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[96]  Fabio Valente Infinite models for speaker clustering , 2006, INTERSPEECH.

[97]  Lawrence E. Kinsler,et al.  Fundamentals of acoustics , 1950 .

[98]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[99]  Julien Bourgeois,et al.  Implicit control of noise canceller for speech enhancement , 2005, INTERSPEECH.

[100]  Ann E. Wells,et al.  Stars in the sky , 1973 .

[101]  Klaus Obermayer,et al.  Correlation and stationarity of speech radiation: consequences for linear multichannel filtering , 2004, IEEE Transactions on Speech and Audio Processing.

[102]  H. W. Sorenson,et al.  Kalman filtering : theory and application , 1985 .

[103]  M. Morf,et al.  The signal subspace approach for multiple wide-band emitter location , 1983 .

[104]  Renato De Mori,et al.  A modified Ephraim-Malah noise suppression rule for automatic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[105]  Guillaume Lathoud,et al.  A sector-based, frequency-domain approach to detection and localization of multiple speakers , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[106]  Raffaele Parisi,et al.  Multi-Source Localization Strategies , 2001, Microphone Arrays.

[107]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[108]  R. G. Leonard,et al.  A database for speaker-independent digit recognition , 1984, ICASSP.

[109]  Volker Hohmann,et al.  Sound source localization in real sound fields based on empirical statistics of interaural parameters. , 2006, The Journal of the Acoustical Society of America.

[110]  Longbiao Wang,et al.  Robust distant speaker recognition based on position dependent cepstral mean normalization , 2005, INTERSPEECH.

[111]  Satoshi Nakamura,et al.  Joint optimization of LCMV beamforming and acoustic echo cancellation , 2004, 2004 12th European Signal Processing Conference.

[112]  Julien Bourgeois,et al.  Sector-Based Detection for Hands-Free Speech Enhancement in Cars , 2006, EURASIP J. Adv. Signal Process..

[113]  Larry J. Greenstein,et al.  Moment-method estimation of the Ricean K-factor , 1999, IEEE Communications Letters.

[114]  Toshiyuki Sekiya,et al.  Speech enhancement based on multiple directivity patterns using a microphone array , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[115]  Jean-Marc Odobez,et al.  Multimodal multispeaker probabilistic tracking in meetings , 2005, ICMI '05.

[116]  Ramani Duraiswami,et al.  Accelerated speech source localization via a hierarchical search of steered response power , 2004, IEEE Transactions on Speech and Audio Processing.

[117]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[118]  S. O. Rice,et al.  Mathematical Analysis of Random Noise-Conclusion , 1945 .

[119]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[120]  Ramesh Harjani,et al.  Acoustic feedback cancellation in hearing aids , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[121]  John N. Tsitsiklis,et al.  Introduction to Probability , 2002 .

[122]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[123]  Bertrand Mesot,et al.  A spectrogram model for enhanced source localization and noise-robust ASR , 2005, INTERSPEECH.

[124]  Shuji Hashimoto,et al.  Multiple Signal Classification by Aggregated Microphones , 2005, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[125]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[126]  Rainer Martin,et al.  SPEECH ENHANCEMENT IN THE DFT DOMAIN USING LAPLACIAN SPEECH PRIORS , 2003 .

[127]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[128]  Jorge S. Marques,et al.  Estimation of the Bayesian network architecture for object tracking in video sequences , 2004, ICPR 2004.