论文信息 - CASSANDRA: audio-video sensor fusion for aggression detection

CASSANDRA: audio-video sensor fusion for aggression detection

This paper presents a smart surveillance system named CASSANDRA, aimed at detecting instances of aggressive human behavior in public environments. A distinguishing aspect of CASSANDRA is the exploitation of the complimentary nature of audio and video sensing to disambiguate scene activity in real-life, noisy and dynamic environments. At the lower level, independent analysis of the audio and video streams yields intermediate descriptors of a scene like: "scream", "passing train" or "articulation energy". At the higher level, a Dynamic Bayesian Network is used as a fusion mechanism that produces an aggregate aggression indication for the current scene. Our prototype system is validated on a set of scenarios performed by professional actors at an actual train station to ensure a realistic audio and video noise setting.

[1] Zoran Zivkovic,et al. Improved adaptive Gaussian mixture model for background subtraction , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[2] Mubarak Shah,et al. Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.

[3] Carlo Tomasi,et al. Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[4] J C Junqua,et al. The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[5] Zoubin Ghahramani,et al. An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[6] Ben J. A. Kröse,et al. An EM-like algorithm for color-histogram-based object tracking , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[7] Dariu Gavrila,et al. The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[8] Thomas F. Quatieri,et al. Speech analysis/Synthesis based on a sinusoidal representation , 1986, IEEE Trans. Acoust. Speech Signal Process..

[9] Z. Zivkovic. Improved adaptive Gaussian mixture model for background subtraction , 2004, ICPR 2004.

[10] K. Scherer. Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[11] Guy J. Brown,et al. Computational auditory scene analysis , 1994, Comput. Speech Lang..

[12] D. T. Kemp,et al. Cochlear Mechanisms: Structure, Function, and Models , 1989, NATO ASI Series.

[13] Xavier Boyen,et al. Tractable Inference for Complex Stochastic Processes , 1998, UAI.