Audiovisual Gestalts

This paper presents an algorithm to correlate audio and visual data generated by the same physical phenomenon. According to psychophysical experiments, temporal synchrony strongly contributes to integrate cross-modal information in humans. Thus, we define meaningful audiovisual structures as temporally proximal audio-video events. Audio and video signals are represented as sparse decompositions over redundant dictionaries of functions. In this way, it is possible to define perceptually meaningful audiovisual events. The detection of these cross-modal structures is done using a simple rule called Helmholtz principle. Experimental results show that extracting significant synchronous audiovisual events, we can detect the existing cross-modal correlation between those signals even in presence of distracting motion and acoustic noise. These results confirm that temporal proximity between audiovisual events is a key ingredient for the integration of information across modalities and that it can be effectively exploited for the design of multi-modal analysis algorithms.

[1]  Lionel Moisan,et al.  Meaningful Alignments , 2000, International Journal of Computer Vision.

[2]  Òscar Divorra Escoda,et al.  Toward sparse and geometry adapted video approximations , 2005 .

[3]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[4]  M. Wallace,et al.  Unifying multisensory signals across time and space , 2004, Experimental Brain Research.

[5]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[6]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  J. Driver Enhancement of selective listening by illusory mislocation of speech sounds due to lip-reading , 1996, Nature.

[9]  Lionel Moisan,et al.  A Grouping Principle and Four Applications , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Trevor Darrell,et al.  Speaker association with signal-level audiovisual fusion , 2004, IEEE Transactions on Multimedia.

[11]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[12]  P. Frossard,et al.  Tree-Based Pursuit: Algorithm and Properties , 2006, IEEE Transactions on Signal Processing.

[13]  Paris Smaragdis,et al.  AUDIO/VISUAL INDEPENDENT COMPONENTS , 2003 .

[14]  Patrick Pérez,et al.  Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.

[15]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[16]  Pierre Vandergheynst,et al.  Analysis of multimodal sequences using geometric video representations , 2006, Signal Process..

[17]  F. Cao Application of the Gestalt principles to the detection of good continuations and corners in image level lines , 2004 .

[18]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.