论文信息 - Unsupervised extraction of audio-visual objects

Unsupervised extraction of audio-visual objects

We propose a novel method to automatically detect and extract the video modality of the sound sources that are present in a scene. For this purpose, we first assess the synchrony between the moving objects captured with a video camera and the sounds recorded by a microphone. Next, video regions presenting a high coherence with the soundtrack are automatically labelled as being part of the source. This represents the starting point for an innovative video segmentation approach, whose objective is to extract the complete audiovisual object. The proposed graph-cut segmentation procedure includes an audio-visual term that links together pixels in regions with high audio-video coherence. Our approach is demonstrated on challenging sequences presenting non-stationary sound sources and distracting moving objects.

Pierre Vandergheynst | Anna Llagostera Casanovas | P. Vandergheynst | A. L. Casanovas

[1] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[2] Pierre Vandergheynst,et al. Nonlinear Video Diffusion based on Audio-Video Synchrony , 2010 .

[3] Jian Sun,et al. Video object cut and paste , 2005, SIGGRAPH 2005.

[4] Sabri Gurbuz,et al. Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[5] Pierre Vandergheynst,et al. Blind Audiovisual Source Separation Based on Sparse Redundant Representations , 2010, IEEE Transactions on Multimedia.

[6] Marie-Pierre Jolly,et al. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[7] Andrew Blake,et al. "GrabCut" , 2004, ACM Trans. Graph..

[8] Michael Elad,et al. Cross-Modal Localization via Sparsity , 2007, IEEE Transactions on Signal Processing.

[9] Trevor Darrell,et al. Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[10] Marie-Pierre Jolly,et al. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[11] Yoichi Sato,et al. Visual localization of non-stationary sound sources , 2009, ACM Multimedia.

[12] Yoichi Sato,et al. Finding Speaker Face Region by Audiovisual Correlation , 2008 .