Joint Audio-Visual Words for Violent Scenes Detection in Movies

This paper presents an audio-visual data representation for violent scenes detection in movies. Existing works in this field consider either the audio or the visual information; or their shallow fusion. None has yet explored their joint dependence for violent scenes detection. We propose a feature which provides strong multi-modal audio and visual cues by first joining the audio and the visual features and then revealing statistically the joint multi-modal patterns. Experimental validation was conducted in the context of the Violent Scenes Detection task of the MediaEval 2013 Multimedia benchmark. The obtained results show the potential of the proposed approach in comparison to methods using audio and visual features separately and other fusion methods.

[1]  Shih-Fu Chang,et al.  Short-term audio-visual atoms for generic video concept classification , 2009, ACM Multimedia.

[2]  Georges Quénot,et al.  LIG at MediaEval 2012 affect task: use of a generic method , 2011, MediaEval.

[3]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[4]  J BealMatthew,et al.  A Graphical Model for Audiovisual Object Tracking , 2003 .

[5]  Mubarak Shah,et al.  Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.

[6]  Patrick Gros,et al.  Audio event detection in movies using multiple audio words and contextual Bayesian networks , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[7]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[8]  Wen Gao,et al.  Detecting Violent Scenes in Movies by Auditory and Visual Cues , 2008, PCM.

[9]  Georges Quénot,et al.  Evaluations of multi-learner approaches for concept indexing in video documents , 2010, RIAO.

[10]  Alexander C. Loui,et al.  Audio-visual grouplet: temporal audio-visual interactions for general video concept classification , 2011, ACM Multimedia.

[11]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[12]  Dong Liu,et al.  Joint audio-visual bi-modal codewords for video event detection , 2012, ICMR.

[13]  Rahul Sukthankar,et al.  Violence Detection in Video Using Computer Vision Techniques , 2011, CAIP.

[14]  Nebojsa Jojic,et al.  A Graphical Model for Audiovisual Object Tracking , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Sergios Theodoridis,et al.  Violence Content Classification Using Audio Features , 2006, SETN.

[16]  Sergios Theodoridis,et al.  Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[17]  Georges Quénot,et al.  LIG at MediaEval 2013 Affect Task: Use of a Generic Method and Joint Audio-Visual Words , 2013, MediaEval.

[18]  Manuele Bicego,et al.  Audio-Visual Event Recognition in Surveillance Video Sequences , 2007, IEEE Transactions on Multimedia.

[19]  Markus Schedl,et al.  The MediaEval 2013 Affect Task: Violent Scenes Detection , 2013, MediaEval.

[20]  Bernd Freisleben,et al.  Multimodal Video Concept Detection via Bag of Auditory Words and Multiple Kernel Learning , 2012, MMM.

[21]  Arnaldo de Albuquerque Araújo,et al.  Violence Detection in Video Using Spatio-Temporal Features , 2010, 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images.