Real-Time Audio-Visual Analysis for Multiperson Videoconferencing

We describe the design of a system consisting of several state-of-the-art real-time audio and video processing components enabling multimodal stream manipulation (e.g., automatic online editing for multiparty videoconferencing applications) in open, unconstrained environments. The underlying algorithms are designed to allow multiple people to enter, interact, and leave the observable scene with no constraints. They comprise continuous localisation of audio objects and its application for spatial audio object coding, detection, and tracking of faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, and the association and fusion of these different events. Combined all together, they represent multimodal streams with audio objects and semantic video objects and provide semantic information for stream manipulation systems (like a virtual director). Various experiments have been performed to evaluate the performance of the system. The obtained results demonstrate the effectiveness of the proposed design, the various algorithms, and the benefit of fusing different modalities in this scenario.

[1]  Jitendra Ajmera,et al.  Robust audio segmentation , 2004 .

[2]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[3]  Giovanni Del Galdo,et al.  Three-Dimensional Sound Field Analysis with Directional Audio Coding Based on Signal Adaptive Parameter Estimators , 2010 .

[4]  Le Zhang,et al.  Real-time ASR from meetings , 2009, INTERSPEECH.

[5]  Jean-Marc Odobez,et al.  Recognizing Visual Focus of Attention From Head Pose in Natural Meetings , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  Jean-Marc Odobez,et al.  Track Creation and Deletion Framework for Long-Term Online Multiface Tracking , 2013, IEEE Transactions on Image Processing.

[7]  Iain McCowan,et al.  A sector-based approach for localization of multiple speakers with microphone arrays , 2004, SAPA@INTERSPEECH.

[8]  Petr Motlícek,et al.  Just-in-time multimodal association and fusion from home entertainment , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[9]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[10]  Giovanni Del Galdo,et al.  A Spatial Filtering Approach for Directional Audio Coding , 2009 .

[11]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Philip N. Garner Silence models in weighted finite-state transducers , 2008, INTERSPEECH.

[13]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[14]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[15]  Sileye O. Ba,et al.  Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.

[16]  Oliver Hellmuth,et al.  Spatial Audio Object Coding (SAOC) - The Upcoming MPEG Standard on Parametric Object Based Audio Coding , 2008 .

[17]  Christof Faller,et al.  ACOUSTIC ECHO SUPPRESSION BASED ON SEPARATION OF STATIONARY AND NON-STATIONARY ECHO COMPONENTS , 2008 .

[18]  Jean-Marc Odobez,et al.  Learning large margin likelihoods for realtime head pose tracking , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[19]  Frank Dellaert,et al.  MCMC-based particle filtering for tracking a variable number of interacting targets , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Rene Kaiser,et al.  Reasoning for video-mediated group communication , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[21]  Jürgen Herre,et al.  Interactive Teleconferencing Combining Spatial Audio Object Coding and DirAC Technology , 2010 .

[22]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Xavier Anguera Miró,et al.  Speaker Diarization for Multi-microphone Meetings Using Only Between-Channel Differences , 2006, MLMI.

[24]  Kuldip K. Paliwal,et al.  Information Fusion and Person Verification Using Speech & Face Information , 2002 .

[25]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[26]  Michael S. Brandstein,et al.  Robust Localization in Reverberant Rooms , 2001, Microphone Arrays.

[27]  Masakiyo Fujimoto,et al.  A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[28]  Ioannis Pitas,et al.  Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Ville Pulkki,et al.  Spatial Sound Reproduction with Directional Audio Coding , 2007 .

[30]  Stefan Duffner,et al.  The TA2 Database ? A Multi-Modal Database From Home Entertainment , 2011, ICSAP 2011.

[31]  Giovanni Del Galdo,et al.  Localization of Sound Sources in Reverberant Environments Based on Directional Audio Coding Parameters , 2009 .

[32]  Jean-Philippe Thiran,et al.  Multimodal speaker localization in a probabilistic framework , 2006, 2006 14th European Signal Processing Conference.

[33]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[34]  Danil Korchagin Audio spatio-temporal fingerprints for cloudless real-time hands-free diarization on mobile devices , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[35]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[36]  Özgür Yilmaz,et al.  On the approximate W-disjoint orthogonality of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[37]  Jean-Marc Odobez,et al.  Joint Adaptive Colour Modelling and Skin, Hair and Clothes Segmentation using Coherent Probabilistic Index Maps , 2011, BMVC.