Complex event detection via attention-based video representation and classification

As an important task in managing unconstrained web videos, multimedia event detection (MED) has attracted wide attention recently. However, due to the complexities such as high abstraction of the events, various scenes and frequent interactions of individuals etc., MED is quite challenging. In this paper, we propose a novel MED algorithm via attention-based video representation and classification. Firstly, inspired by human’s selective attention mechanism, an attention-based saliency localization network (ASLN) is constructed to quickly predict the semantic saliency objects of video frames. Afterwards, in order to complementarily represent salient objects and the surroundings, two Convolutional Neural Networks (CNNs) features, i.e., local saliency feature and global feature are respectively extracted from the salient objects and the whole feature map. Thirdly, after binding two features together, Vector of Locally Aggregated Descriptors (VLAD) is applied to encode them into the video representation. Finally, the linear Support Vector Machine (SVM) classifiers are trained to classify. We extensively evaluate the performance on TRECVID MED14_10Ex, MED14_100Ex and Columbia Consume Video (CCV) datasets. Experimental results show that the proposed single model outperforms state-of-the-art approaches on all three real-world video datasets, and demonstrate the effectiveness.

[1]  Vasileios Mezaris,et al.  Video event detection using generalized subclass discriminant analysis and linear support vector machines , 2014, ICMR.

[2]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[3]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Maryanne Martin Local and global processing: The role of sparsity , 1979 .

[5]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Yi Yang,et al.  They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hui Cheng,et al.  Semantic pooling for complex event detection , 2013, MM '13.

[9]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Dong Liu,et al.  Joint audio-visual bi-modal codewords for video event detection , 2012, ICMR.

[12]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[13]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[14]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Nikos Komodakis,et al.  Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  B. S. Manjunath,et al.  Eye tracking assisted extraction of attentionally important objects from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  G. Miller Learning to Forget , 2004, Science.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Fei Su,et al.  Specific video identification via joint learning of latent semantic concept, scene and temporal structure , 2016, Neurocomputing.

[26]  Xi Wang,et al.  Evaluating Two-Stream CNN for Video Classification , 2015, ICMR.

[27]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yi Yang,et al.  Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[29]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[30]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Nicu Sebe,et al.  The Mystery of Faces: Investigating Face Contribution for Multimedia Event Detection , 2014, ICMR.

[32]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[33]  Dong Liu,et al.  BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems , 2012, TRECVID.

[34]  Nicu Sebe,et al.  Knowledge Adaptation with PartiallyShared Features for Event DetectionUsing Few Exemplars , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Chengqi Zhang,et al.  Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[36]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[37]  C. Koch Strategies and Models of Selective Attention , 2010 .

[38]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[39]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[41]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  J. R. Pomerantz Global and local precedence: selective attention in form and motion perception. , 1983, Journal of experimental psychology. General.

[43]  Dong Liu,et al.  Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences , 2014, ECCV.

[44]  Stefan Treue,et al.  Feature-based attention influences motion processing gain in macaque visual cortex , 1999, Nature.