Detecting violent content in Hollywood movies by mid-level audio representations

Movie violent content detection e.g., for providing automated youth protection services is a valuable video content analysis functionality. Choosing discriminative features for the representation of video segments is a key issue in designing violence detection algorithms. In this paper, we employ mid-level audio features which are based on a Bag-of-Audio Words (BoAW) method using Mel-Frequency Cepstral Coefficients (MFCCs). BoAW representations are constructed with two different methods, namely the vector quantization-based (VQ-based) method and the sparse coding-based (SC-based) method. We choose two-class support vector machines (SVMs) for classifying video shots as (non-)violent. Our experiments on detecting violent video shots in Hollywood movies show that the mid-level audio features provide promising results. Additionally, we establish that the SC-based method outperforms the VQ-based one. More importantly, the SC-based method outperforms the unimodal submissions in the MediaEval Violent Scenes Detection (VSD) task, except one vision-based method in terms of average precision.

[1]  Sergios Theodoridis,et al.  Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[2]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[3]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Rong Yan,et al.  Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[6]  Sergios Theodoridis,et al.  Violence Content Classification Using Audio Features , 2006, SETN.

[7]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[8]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Mohammad Soleymani,et al.  The MediaEval 2011 Affect Task: Violent Scenes Detection in Hollywood movies , 2011, MediaEval.

[10]  Wen Gao,et al.  Detecting Violent Scenes in Movies by Auditory and Visual Cues , 2008, PCM.

[11]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[12]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  L. R. Huesmann,et al.  Short-term and long-term effects of violent media on aggression in children and adults. , 2006, Archives of pediatrics & adolescent medicine.

[14]  Bing Li,et al.  Horror video scene recognition via Multiple-Instance learning , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.