Deep visual-semantic for crowded video understanding

Visual-semantic features play a vital role for crowded video understanding. Convolutional Neural Networks (CNNs) have experienced a significant breakthrough in learning representations from images. However, the learning of visualsemantic features, and how it can be effectively extracted for video analysis, still remains a challenging task. In this study, we propose a novel visual-semantic method to capture both appearance and dynamic representations. In particular, we propose a spatial context method, based on the fractional Fisher vector (FV) encoding on CNN features, which can be regarded as our main contribution. In addition, to capture temporal context information, we also applied fractional encoding method on dynamic images. Experimental results on the WWW crowed video dataset demonstrate that the proposed method outperform the state of the art.

[1]  Xiaogang Wang,et al.  Scene-Independent Group Profiling in Crowd , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Xiaogang Wang,et al.  Learning Collective Crowd Behaviors with Dynamic Pedestrian-Agents , 2014, International Journal of Computer Vision.

[6]  Hong Zhang,et al.  Multiple kernel visual-auditory representation learning for retrieval , 2016, Multimedia Tools and Applications.

[7]  Chunhua Deng,et al.  Exploiting Attribute Dependency for Attribute Assignment in Crowded Scenes , 2016, IEEE Signal Processing Letters.

[8]  Xiaogang Wang,et al.  Deeply learned attributes for crowded scene understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[11]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Xiu-Shen Wei,et al.  Scalable Multi-instance Learning , 2014, 2014 IEEE International Conference on Data Mining.

[13]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[14]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[15]  Qingming Huang,et al.  Abnormal event detection in crowded scenes based on Structural Multi-scale Motion Interrelated Patterns , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[16]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).