Speaker Detection and Applications to Cross-Modal Analysis of Planning Meetings

Detection of meeting events is one of the most important tasks in multimodal analysis of planning meetings. Speaker detection is a key step for extraction of most meaningful meeting events. In this paper, we present an approach of speaker localization using combination of visual and audio information in multimodal meeting analysis. When talking, people make a speech accompanying mouth movements and hand gestures. By computing correlation of audio signals, mouth movements, and hand motion, we detect a talking person both spatially and temporally. Three kinds of features are extracted for speaker localization. Hand movements are expressed by hand motion efforts; audio features are expressed by computing 12 mel-frequency cepstral coefficients from audio signals, and mouth movements are expressed by normalized cross-correlation coefficients of mouth area between two successive frames. A time delay neural network is trained to learn the correlation relationships, which is then applied to perform speaker localization. Experiments and applications in planning meeting environments are provided.

[1]  Francis K. H. Quek,et al.  Head Tracking with 3D Texture Map Model in Planning Meeting Analysis , 2005 .

[2]  Hynek Hermansky,et al.  A new speaker change detection method for two-speaker segmentation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  James M. Rehg,et al.  Vision-based speaker detection using Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[4]  Vladimir Vezhnevets,et al.  A Survey on Pixel-Based Skin Color Detection Techniques , 2003 .

[5]  Vladimir Pavlovic,et al.  Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection , 2002, Object recognition supported by user interaction for service robots.

[6]  Chalapathy Neti,et al.  Audio-visual intent-to-speak detection for human-computer interaction , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  D. Howard,et al.  Speech and audio signal processing: processing and perception of speech and music [Book Review] , 2000 .

[8]  Vladimir Pavlovic,et al.  Multimodal Speaker Detection Using Input/Output Dynamic Bayesian Networks , 2000, ICMI.

[9]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[10]  Francis Quek,et al.  A parallel algorithm for dynamic gesture tracking , 1999, Proceedings International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. In Conjunction with ICCV'99 (Cat. No.PR00378).

[11]  Ben J. A. Kröse,et al.  On-line multi-modal speaker diarization , 2007, ICMI '07.

[12]  Francis K. H. Quek,et al.  Meeting room configuration and multiple camera calibration in meeting analysis , 2005, ICMI '05.

[13]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[15]  Sudeep Sarkar,et al.  Audio Segmentation and Speaker Localization in Meeting Videos , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[16]  Anna Esposito,et al.  Automatic Hand Hold Detection in Natural Conversation , 2001 .

[17]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[18]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.