论文信息 - Detecting violent content in Hollywood movies by mid-level audio representations

Detecting violent content in Hollywood movies by mid-level audio representations

Movie violent content detection e.g., for providing automated youth protection services is a valuable video content analysis functionality. Choosing discriminative features for the representation of video segments is a key issue in designing violence detection algorithms. In this paper, we employ mid-level audio features which are based on a Bag-of-Audio Words (BoAW) method using Mel-Frequency Cepstral Coefficients (MFCCs). BoAW representations are constructed with two different methods, namely the vector quantization-based (VQ-based) method and the sparse coding-based (SC-based) method. We choose two-class support vector machines (SVMs) for classifying video shots as (non-)violent. Our experiments on detecting violent video shots in Hollywood movies show that the mid-level audio features provide promising results. Additionally, we establish that the SC-based method outperforms the VQ-based one. More importantly, the SC-based method outperforms the unimodal submissions in the MediaEval Violent Scenes Detection (VSD) task, except one vision-based method in terms of average precision.

[1] Sergios Theodoridis,et al. Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[2] Weiqiang Wang,et al. Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[3] Li Li,et al. A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4] Rong Yan,et al. Semi-supervised cross feature learning for semantic concept detection in videos , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5] Guillermo Sapiro,et al. Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[6] Sergios Theodoridis,et al. Violence Content Classification Using Audio Features , 2006, SETN.

[7] Stephen Kwek,et al. Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[8] Jean Ponce,et al. Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9] Mohammad Soleymani,et al. The MediaEval 2011 Affect Task: Violent Scenes Detection in Hollywood movies , 2011, MediaEval.

[10] Wen Gao,et al. Detecting Violent Scenes in Movies by Auditory and Visual Cues , 2008, PCM.

[11] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[12] Mohan S. Kankanhalli,et al. Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13] L. R. Huesmann,et al. Short-term and long-term effects of violent media on aggression in children and adults. , 2006, Archives of pediatrics & adolescent medicine.

[14] Bing Li,et al. Horror video scene recognition via Multiple-Instance learning , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Thomas Hofmann,et al. Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.