A novel violent videos classification scheme based on the bag of audio words features [Document Suppressed in IEEE Xplore]

[This paper has been withdrawn by the publisher]. A novel method to identify the violent videos only with audio features is introduced. Most previous content-based image or video classification schemes apply the bag of words (BOW) or bag of visual words (BOVW), which employ multiple visual features to characterize image or video content. In our method, the bag of audio words (BOAW) is suggested to be built by effective audio features. Two reasons are considered here. First, audio features should have very special significance for violent videos. Second, the computational complexity of dealing with audio features is much lower than that of visual features. The MPEG-7 low level features such as Audio Spectrum-Centroid and Audio Spectrum-Spread, and the high level feature such as Audio Signature, are combined into one 44-dimensions vector in the BOAW model. The audio words are built from the vector by the clustering strategy, and support vector machine (SVM) with revised soft-weighting scheme is used to group the audio words features into two classes, i.e. the violent and non-violent. Experiments demonstrate that the proposed method can achieve good recall accuracy and precision accuracy on detecting violent videos. The method also can be applied to classify other types of videos.

[1]  Ling Guan,et al.  A New Learning Algorithm for the Fusion of Adaptive Audio–Visual Features for the Retrieval and Classification of Movie Clips , 2010, J. Signal Process. Syst..

[2]  N. H. C. Yung,et al.  Scene categorization via contextual visual words , 2010, Pattern Recognit..

[3]  Jacob Goldberger,et al.  Detection of Urban Zones in Satellite Images using Visual Words , 2008, IGARSS 2008 - 2008 IEEE International Geoscience and Remote Sensing Symposium.

[4]  Yao Zhao,et al.  Commercial Shot Classification Based on Multiple Features Combination , 2010, IEICE Trans. Inf. Syst..

[5]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[6]  Sergios Theodoridis,et al.  A Multi-Class Audio Classification Method With Respect To Violent Content In Movies Using Bayesian Networks , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[7]  Wei Liang,et al.  A novel approach to musical genre classification using probabilistic latent semantic analysis model , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[8]  Lie Lu,et al.  Co-clustering for Auditory Scene Categorization , 2008, IEEE Transactions on Multimedia.

[9]  Fernando Pereira,et al.  MPEG-7 the generic multimedia content description standard, part 1 - Multimedia, IEEE , 2001 .

[10]  Tao Mei,et al.  Contextual Bag-of-Words for Visual Categorization , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  De Xu,et al.  Region Contextual Visual Words for scene categorization , 2011, Expert Syst. Appl..

[12]  Heng-Da Cheng,et al.  Approaches for automated detection and classification of masses in mammograms , 2006, Pattern Recognit..

[13]  Wesley De Neve,et al.  Semantic home video categorization , 2009, Electronic Imaging.

[14]  Alberto Del Bimbo,et al.  Video event classification using string kernels , 2010, Multimedia Tools and Applications.

[15]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[16]  Wen Gao,et al.  Detecting Violent Scenes in Movies by Auditory and Visual Cues , 2008, PCM.

[17]  Jiong Yu,et al.  A Novel Fusion Method for Semantic Concept Classification in Video , 2009, J. Softw..

[18]  Johannes D. Krijnders,et al.  CASSANDRA: audio-video sensor fusion for aggression detection , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.

[19]  Jun Zhang,et al.  Series feature aggregation for content-based image retrieval , 2010, Comput. Electr. Eng..

[20]  Runsheng Wang,et al.  Semantic modeling of natural scenes based on contextual Bayesian networks , 2010, Pattern Recognit..

[21]  Arnold W. M. Smeulders,et al.  Real-Time Visual Concept Classification , 2010, IEEE Transactions on Multimedia.

[22]  Alberto Messina,et al.  Parallel neural networks for multimodal video genre classification , 2008, Multimedia Tools and Applications.

[23]  Toshihisa Takagi,et al.  Classifying Biomedical Figures Using Combination of Bag of Keypoints and Bag of Words , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[24]  Arnaldo de Albuquerque Araújo,et al.  Violence Detection in Video Using Spatio-Temporal Features , 2010, 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images.

[25]  François Pachet,et al.  The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. , 2007, The Journal of the Acoustical Society of America.

[26]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[27]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[28]  Sergios Theodoridis,et al.  A Multimodal Approach to Violence Detection in Video Sharing Sites , 2010, 2010 20th International Conference on Pattern Recognition.

[29]  Florent Perronnin,et al.  Universal and Adapted Vocabularies for Generic Visual Categorization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Chong-Wah Ngo,et al.  Towards optimal bag-of-features for object categorization and semantic video retrieval , 2007, CIVR '07.