Scene Recognition in Short Video with Multi-Resolution CNNs

In order to solve the problems of scene recognition in short videos, this paper proposes a deep fusion network based on VGGNet. Firstly, VGGNet16 is used to learn global features, and VGGNet19 is used to learn images in details. After that the learning features are fused by means of weighted averaging; In the public dataset 2017-AI-Challenger-scene-classification, the result of top3 is 92.2% and the top3 of the Charades short video dataset has achieved 78.9%, which proves that the proposed method has a good performance in scene recognition.

[1]  Limin Wang,et al.  Knowledge Guided Disambiguation for Large-Scale Scene Classification With Multi-Resolution CNNs , 2016, IEEE Transactions on Image Processing.

[2]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[3]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[4]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5]  Mohammed Bennamoun,et al.  A Discriminative Representation of Convolutional Features for Indoor Scene Recognition , 2015, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[6]  Max A. Viergever,et al.  Deep Learning for Multi-Task Medical Image Segmentation in Multiple Modalities , 2016, MICCAI.

[7]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[8]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Heping Li Multi-scale Spatial Topic Models for scene recognition , 2016, 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI).

[10]  Meng Liu,et al.  Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning , 2019, IEEE Transactions on Image Processing.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).