Saliency-guided video classification via adaptively weighted learning

Video classification is productive in many practical applications, and the recent deep learning has greatly improved its accuracy. However, existing works often model video frames indiscriminately, but from the view of motion, video frames can be decomposed into salient and non-salient areas naturally. Salient and non-salient areas should be modeled with different networks, for the former present both appearance and motion information, and the latter present static background information. To address this problem, in this paper, video saliency is predicted by optical flow without supervision firstly. Then two streams of 3D CNN are trained individually for raw frames and optical flow on salient areas, and another 2D CNN is trained for raw frames on non-salient areas. For the reason that these three streams play different roles for each class, the weights of each stream are adaptively learned for each class. Experimental results show that saliency-guided modeling and adaptively weighted learning can reinforce each other, and we achieve the states-of-the-art results.

[1]  Dong Liu,et al.  Discovering joint audio–visual codewords for video event detection , 2013, Machine Vision and Applications.

[2]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Shuicheng Yan,et al.  STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[9]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[10]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  References , 1971 .

[12]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[15]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16]  Nicu Sebe,et al.  Feature Weighting via Optimal Thresholding for Video Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).