Multi-Perspective Cost-Sensitive Context-Aware Multi-Instance Sparse Coding and Its Application to Sensitive Video Recognition

With the development of video-sharing websites, P2P, micro-blog, mobile WAP websites, and so on, sensitive videos can be more easily accessed. Effective sensitive video recognition is necessary for web content security. Among web sensitive videos, this paper focuses on violent and horror videos. Based on color emotion and color harmony theories, we extract visual emotional features from videos. A video is viewed as a bag and each shot in the video is represented by a key frame which is treated as an instance in the bag. Then, we combine multi-instance learning (MIL) with sparse coding to recognize violent and horror videos. The resulting MIL-based model can be updated online to adapt to changing web environments. We propose a cost-sensitive context-aware multi- instance sparse coding (MI-SC) method, in which the contextual structure of the key frames is modeled using a graph, and fusion between audio and visual features is carried out by extending the classic sparse coding into cost-sensitive sparse coding. We then propose a multi-perspective multi- instance joint sparse coding (MI-J-SC) method that handles each bag of instances from an independent perspective, a contextual perspective, and a holistic perspective. The experiments demonstrate that the features with an emotional meaning are effective for violent and horror video recognition, and our cost-sensitive context-aware MI-SC and multi-perspective MI-J-SC methods outperform the traditional MIL methods and the traditional SVM and KNN-based methods.

[1]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[2]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Hang-Bong Kang,et al.  Affective content detection using HMMs , 2003, ACM Multimedia.

[4]  Markus Schedl,et al.  VSD2014: A dataset for violent scenes detection in hollywood movies and web videos , 2015, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI).

[5]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[6]  A. Field,et al.  Fear information and the development of fears during childhood: effects on implicit fear responses and behavioural avoidance. , 2003, Behaviour research and therapy.

[7]  Jinhui Tang,et al.  Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks , 2014, MediaEval.

[8]  Weiqiang Wang,et al.  Weakly-Supervised Violence Detection in Movies with Audio and Video Based Co-training , 2009, PCM.

[9]  JongSuk Choi,et al.  Single-channel particular voice activity detection for monitoring the violence situations , 2013, 2013 IEEE RO-MAN.

[10]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[11]  L. Ou,et al.  A study of colour emotion and colour preference. Part I: Colour emotions for single colours , 2004 .

[12]  T H Ollendick,et al.  Etiology of childhood phobias: current status of Rachman's three pathways theory. , 1998, Behaviour research and therapy.

[13]  Vanessa Testoni,et al.  RECOD at MediaEval 2014: Violent Scenes Detection Task , 2014, MediaEval.

[14]  Sheng Tang,et al.  Fusing Audio-Words with Visual Features for Pornographic Video Detection , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[15]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[16]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[17]  Mihaela van der Schaar,et al.  Contextual Online Learning for Multimedia Content Aggregation , 2015, IEEE Transactions on Multimedia.

[18]  Frank Hopfgartner,et al.  Detecting violent content in Hollywood movies by mid-level audio representations , 2013, 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI).

[19]  L. Ou,et al.  A study of colour emotion and colour preference. Part III: Colour preference modeling , 2004 .

[20]  Ioannis Pitas,et al.  Information theory-based shot cut/fade detection and video summarization , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  James T. Kwok,et al.  Online multiple instance learning with no regret , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Bing Li,et al.  Horror Video Scene Recognition Based on Multi-view Multi-instance Learning , 2012, ACCV.

[24]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[25]  Sergios Theodoridis,et al.  Violence Content Classification Using Audio Features , 2006, SETN.

[26]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Hongbin Zha,et al.  Adaptive p-posterior mixture-model kernels for multiple instance learning , 2008, ICML '08.

[28]  L. Ou,et al.  A colour harmony model for two-colour combinations , 2006 .

[29]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[30]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[31]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[32]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[33]  Carlos Orrite-Uruñuela,et al.  ViVoLab and CVLab - MediaEval 2014: Violent Scenes Detection Affect Task , 2014, MediaEval.

[34]  Yixin Chen,et al.  Image Categorization by Learning and Reasoning with Regions , 2004, J. Mach. Learn. Res..

[35]  Alan F. Smeaton,et al.  Automatically selecting shots for action movie trailers , 2006, MIR '06.

[36]  Zhi-Hua Zhou,et al.  On the relation between multi-instance learning and semi-supervised learning , 2007, ICML '07.

[37]  Nicu Sebe,et al.  Multimedia Event Detection Using A Classifier-Specific Intermediate Representation , 2013, IEEE Transactions on Multimedia.

[38]  Zhang Zhang,et al.  Violence Video Detection by Discriminative Slow Feature Analysis , 2012, CCPR.

[39]  Bin Wu,et al.  A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model , 2011, MMM.

[40]  Qiang Wu,et al.  Violent video detection based on MoSIFT feature and sparse coding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[42]  Mubarak Shah,et al.  Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.

[43]  Min Xu,et al.  Affective content analysis in comedy and horror videos by audio emotional event detection , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[44]  Ming-Hsuan Yang,et al.  Visual tracking with online Multiple Instance Learning , 2009, CVPR.

[45]  Xiao-Ping Zhang,et al.  Efficient Heuristic Methods for Multimodal Fusion and Concept Fusion in Video Concept Detection , 2015, IEEE Transactions on Multimedia.

[46]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[47]  Kai-Kuang Ma,et al.  A new diamond search algorithm for fast block-matching motion estimation , 2000, IEEE Trans. Image Process..

[48]  Sergios Theodoridis,et al.  Audio-Visual Fusion for Detecting Violent Scenes in Videos , 2010, SETN.

[49]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[50]  Sergios Theodoridis,et al.  A Multi-Class Audio Classification Method With Respect To Violent Content In Movies Using Bayesian Networks , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[51]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[52]  Adrian Ulges,et al.  Detecting pornographic video content by combining image features with motion information , 2009, ACM Multimedia.

[53]  Zhi-Hua Zhou,et al.  Multi-instance learning by treating instances as non-I.I.D. samples , 2008, ICML '09.

[54]  Arnold W. M. Smeulders,et al.  c ○ 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. A Six-Stimulus Theory for Stochastic Texture , 2002 .

[55]  Shuicheng Yan,et al.  Visual classification with multi-task joint sparse representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  Mohammad Soleymani,et al.  Corpus Development for Affective Video Indexing , 2012, IEEE Transactions on Multimedia.

[57]  S. C. Hui,et al.  An intelligent categorization engine for bilingual web content filtering , 2005, IEEE Transactions on Multimedia.

[58]  Andreas Jakobsson,et al.  Classification of indecent videos by low complexity repetitive motion detection , 2008, 2008 37th IEEE Applied Imagery Pattern Recognition Workshop.

[59]  Sehun Kim,et al.  Hierarchical system for objectionable video detection , 2009, IEEE Transactions on Consumer Electronics.

[60]  Yaser Sheikh,et al.  On the use of computable features for film classification , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[61]  Qi Zhang,et al.  EM-DD: An Improved Multiple-Instance Learning Technique , 2001, NIPS.

[62]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.