Audio-Video Integration for Background Modelling

This paper introduces a new concept of surveillance, namely, audio-visual data integration for background modelling. Actually, visual data acquired by a fixed camera can be easily supported by audio information allowing a more complete analysis of the monitored scene. The key idea is to build a multimodal model of the scene background, able to promptly detect single auditory or visual events, as well as simultaneous audio and visual foreground situations. In this way, it is also possible to tackle some open problems (e.g., the sleeping foreground problems) of standard visual surveillance systems, if they are also characterized by an audio foreground. The method is based on the probabilistic modelling of the audio and video data streams using separate sets of adaptive Gaussian mixture models, and on their integration using a coupled audio-video adaptive model working on the frame histogram, and the audio frequency spectrum. This framework has shown to be able to evaluate the time causality between visual and audio foreground entities. To the best of our knowledge, this is the first attempt to the multimodal modelling of scenes working on-line and using one static camera and only one microphone. Preliminary results show the effectiveness of the approach at facing problems still unsolved by only visual monitoring approaches.

[1]  K. L. Doty Digital Spectral Analysis of Audio Signals , 1965 .

[2]  Trevor Darrell,et al.  Geometric and Statistical Approaches to Audiovisual Segmentation , 2005 .

[3]  Dana H. Ballard,et al.  Computer Vision , 1982 .

[4]  K. Wilson,et al.  Person Tracking Using Audio-Video Sensor Fusion , 2001 .

[5]  C.-C. Jay Kuo,et al.  Audio content analysis for online audiovisual data segmentation and classification , 2001, IEEE Trans. Speech Audio Process..

[6]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[7]  Zoran Duric,et al.  Using histograms to detect and track objects in color video , 2001, Proceedings 30th Applied Imagery Pattern Recognition Workshop (AIPR 2001). Analysis and Understanding of Time Varying Imagery.

[8]  Kenneth O. Johnson,et al.  Synchrony: a neuronal mechanism for attentional selection? , 2002, Current Opinion in Neurobiology.

[9]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[10]  Deniz Erdoğmuş,et al.  ON-LINE MINIMUM MUTUAL INFORMATION METHOD FOR TIME-VARYING BLIND SOURCE SEPARATION , 2001 .

[11]  Trevor Darrell,et al.  Audio-video array source separation for perceptual user interfaces , 2001, PUI '01.

[12]  C. K. Yuen,et al.  Digital spectral analysis , 1979 .

[13]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[14]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[15]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[17]  Kentaro Toyama,et al.  Wallflower: principles and practice of background maintenance , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[18]  B. Stein,et al.  The Merging of the Senses , 1993 .

[19]  Sam T. Roweis,et al.  One Microphone Source Separation , 2000, NIPS.