Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features

In this paper we address the problem of estimating who is speaking from automatically extracted low resolution visual cues from group meetings. Traditionally, the task of speech/non-speech detection or speaker diarization tries to find who speaks and when from audio features only. Recent work has addressed the problem audio-visually but often with less emphasis on the visual component. Due to the high probability of losing the audio stream during video conferences, this work proposes methods for estimating speech using just low resolution visual cues. We carry out experiments to compare how context through the observation of group behaviour and task-oriented activities can help improve estimates of speaking status. We test on 105 minutes of natural meeting data with unconstrained conversations.

[1]  Rashid Ansari,et al.  Multimodal human discourse: gesture and speech , 2002, TCHI.

[2]  Ben J. A. Kröse,et al.  On-line multi-modal speaker diarization , 2007, ICMI '07.

[3]  D. McNeill Language and Gesture: Gesture in action , 2000 .

[4]  Jean-Marc Odobez,et al.  Visual activity context for focus of attention estimation in dynamic meetings , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[5]  Dirk Heylen,et al.  in head orientation between speakers and listeners in multi-party conversations , 2005 .

[6]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  J. Odobez,et al.  A Rao-Blackwellized Mixed State Particle Filter for Head Pose Tracking , 2005 .

[8]  Alessandro Vinciarelli,et al.  Role recognition in multiparty recordings using social affiliation networks and discrete distributions , 2008, ICMI '08.

[9]  Alejandro Jaimes Posture and activity silhouettes for self-reporting, interruption management, and attentive interfaces , 2006, IUI '06.

[10]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[11]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Shaogang Gong,et al.  Modelling facial colour and identity with Gaussian mixtures , 1998, Pattern Recognit..

[13]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[14]  Jithendra Vepa,et al.  The segmentation of multi-channel meeting recordings for automatic speech recognition , 2006, INTERSPEECH.

[15]  Louis-Philippe Morency,et al.  Predicting Listener Backchannels: A Probabilistic Multimodal Approach , 2008, IVA.

[16]  Trevor Darrell,et al.  A multi-modal approach for determining speaker location and focus , 2003, ICMI '03.

[17]  Chuohao Yeo,et al.  Compressed domain video processing of meetings for activity estimation in dominance classification and slide transition detection , 2008 .

[18]  Chuohao Yeo,et al.  Associating audio-visual activity cues in a dominance estimation framework , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  John W. Fisher,et al.  Dynamic Dependency Tests for Audio-Visual Speaker Association , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Christian A. Müller,et al.  A fast-match approach for robust, faster than real-time speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[21]  Chuohao Yeo,et al.  Modeling Dominance in Group Conversations Using Nonverbal Activity Cues , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Sudeep Sarkar,et al.  Audio Segmentation and Speaker Localization in Meeting Videos , 2006, 18th International Conference on Pattern Recognition (ICPR'06).