Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks

It is well known that contextual and multi-scale representations are important for accurate visual recognition. In this paper we present the Inside-Outside Net (ION), an object detector that exploits information both inside and outside the region of interest. Contextual information outside the region of interest is integrated using spatial recurrent neural networks. Inside, we use skip pooling to extract information at multiple scales and levels of abstraction. Through extensive experiments we evaluate the design space and provide readers with an overview of what tricks of the trade are important. ION improves state-of-the-art on PASCAL VOC 2012 object detection from 73.9% to 77.9% mAP. On the new and more challenging MS COCO dataset, we improve state-of-the-art from 19.7% to 33.1% mAP. In the 2015 MS COCO Detection Challenge, our ION model won "Best Student Entry" and finished 3rd place overall. As intuition suggests, our detection results provide strong evidence that context and multi-scale representations improve small object detection.

[1]  R. Vaillant,et al.  An original approach for the localization of objects in images , 1993 .

[2]  R. Vaillant,et al.  Original approach for the localisation of objects in images , 1994 .

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[5]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation , 2003, Lecture Notes in Computer Science.

[6]  J. Koenderink,et al.  Representation of local geometry in the visual system , 1987, Biological Cybernetics.

[7]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[9]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[10]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[13]  Tsuhan Chen,et al.  Exploring Tiny Images: The Roles of Appearance and Contextual Information for Machine and Human Object Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Derek Hoiem,et al.  Diagnosing Error in Object Detectors , 2012, ECCV.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[17]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[19]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[20]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[22]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[26]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[27]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[28]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ronan Collobert,et al.  Learning to Segment Object Candidates , 2015, NIPS.

[32]  Yoshua Bengio,et al.  ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[33]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[34]  Marcus Liwicki,et al.  Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[37]  Nikos Komodakis,et al.  Object Detection via a Multi-region and Semantic Segmentation-Aware CNN Model , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Fuchun Sun,et al.  HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.