Learning Semantic Feature Map for Visual Content Recognition

The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of image and video contents. Specifically, we first extract high-level semantic object features on input image with convolutional neural networks for every object proposals, and organize them to the designed SFM so that spatial information among objects are preserved. To fully exploit the spatial relationship among objects, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) on top of SFM for final recognition. For better training, we also introduce a multi-task learning framework to train the model in an end-to-end manner. It is composed of an overall image classification loss as well as a grid labeling loss, which predicts the objects label at each SFM grid. Extensive experiments are conducted to verify the effectiveness of the proposed approach. For image classification, very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks. We also directly transfer the SFM learned on image domain to the video classification task. The results on CCV benchmark demonstrate the robustness and generalization capability of the proposed approach.

[1]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[2]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[6]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[8]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dong Liu,et al.  Discovering joint audio–visual codewords for video event detection , 2013, Machine Vision and Applications.

[10]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[11]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xi Wang,et al.  Exploiting Objects with LSTMs for Video Categorization , 2016, ACM Multimedia.

[13]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[14]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Xiangyang Xue,et al.  Regional Gating Neural Networks for Multi-label Image Classification , 2016, BMVC.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[17]  Meng Wang,et al.  Beyond Object Proposals: Random Crop Pooling for Multi-Label Image Recognition , 2016, IEEE Transactions on Image Processing.

[18]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cees Snoek,et al.  No spare parts: Sharing part detectors for image categorization , 2015, Comput. Vis. Image Underst..

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[25]  Gang Wang,et al.  Learning Contextual Dependence With Convolutional Hierarchical Recurrent Neural Networks , 2015, IEEE Transactions on Image Processing.

[26]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[29]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Pong C. Yuen,et al.  Reduced Analytic Dependency Modeling: Robust Fusion for Visual Recognition , 2014, International Journal of Computer Vision.

[32]  Jian Dong,et al.  Subcategory-Aware Object Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Jun Wang,et al.  Deep Attributes from Context-Aware Regional Neural Codes , 2015, ArXiv.

[34]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Nicu Sebe,et al.  Feature Weighting via Optimal Thresholding for Video Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Hao Su,et al.  Object Bank: An Object-Level Image Representation for High-Level Visual Recognition , 2014, International Journal of Computer Vision.

[38]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[39]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[42]  Cees Snoek,et al.  UvA-DARE ( Digital Academic Repository ) Event Fisher Vectors : Robust Encoding Visual Diversity of Visual Streams , 2015 .

[43]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Jianfei Cai,et al.  Can Partial Strong Labels Boost Multi-label Object Recognition? , 2015, ArXiv.

[47]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[49]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.