Just-in-time multimodal association and fusion from home entertainment

In this paper, we describe a real-time multimodal analysis system with just-in-time multimodal association and fusion for a living room environment, where multiple people may enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed to be used in open, unconstrained environments like in next generation video conferencing systems that automatically “orchestrate” the transmitted video streams to improve the overall experience of interaction between spatially separated families and friends. Performance levels achieved to date on hand-labelled dataset have shown sufficient reliability at the same time as fulfilling real-time processing requirements.

[1]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[2]  Eric Horvitz,et al.  Dialog in the open world: platform and applications , 2009, ICMI-MLMI '09.

[3]  Kuldip K. Paliwal,et al.  Information Fusion and Person Verification Using Speech & Face Information , 2002 .

[4]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[5]  Frank Dellaert,et al.  MCMC-based particle filtering for tracking a variable number of interacting targets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Jean-Philippe Thiran,et al.  Multimodal speaker localization in a probabilistic framework , 2006, 2006 14th European Signal Processing Conference.

[7]  Stefan Duffner,et al.  The TA2 Database ? A Multi-Modal Database From Home Entertainment , 2011, ICSAP 2011.

[8]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[9]  Christof Faller,et al.  ACOUSTIC ECHO SUPPRESSION BASED ON SEPARATION OF STATIONARY AND NON-STATIONARY ECHO COMPONENTS , 2008 .

[10]  Petr Motlícek,et al.  Hands free audio analysis from home entertainment , 2010, INTERSPEECH.

[11]  Le Zhang,et al.  Real-time ASR from meetings , 2009, INTERSPEECH.

[12]  Masakiyo Fujimoto,et al.  A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[13]  Rene Kaiser,et al.  Reasoning for video-mediated group communication , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[14]  Pierre Vandergheynst,et al.  Analysis of multimodal sequences using geometric video representations , 2006, Signal Process..

[15]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[16]  Iain McCowan,et al.  A sector-based approach for localization of multiple speakers with microphone arrays , 2004, SAPA@INTERSPEECH.