Learning scene gist with convolutional neural networks to improve object recognition

Advancements in convolutional neural networks (CNNs) have made significant strides toward achieving high performance levels on multiple object recognition tasks. While some approaches utilize information from the entire scene to propose regions of interest, the task of interpreting a particular region or object is still performed independently of other objects and features in the image. Here we demonstrate that a scene's ‘gist’ can significantly contribute to how well humans can recognize objects. These findings are consistent with the notion that humans foveate on an object and incorporate information from the periphery to aid in recognition. We use a biologically inspired two-part convolutional neural network ('GistNet') that models the fovea and periphery to provide a proof-of-principle demonstration that computational object recognition can significantly benefit from the gist of the scene as contextual information. Our model yields accuracy improvements of up to 50% in certain object categories when incorporating contextual gist, while only increasing the original model size by 5%. This proposed model mirrors our intuition about how the human visual system recognizes objects, suggesting specific biologically plausible constraints to improve machine vision and building initial steps towards the challenge of scene understanding.

[1]  Denis Fize,et al.  Speed of processing in the human visual system , 1996, Nature.

[2]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[5]  Antonio Torralba,et al.  Contextual Models for Object Detection Using Boosted Random Fields , 2004, NIPS.

[6]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[7]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[10]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[11]  Alexander C. Schütz,et al.  Trans-saccadic integration of peripheral and foveal feature information is close to optimal. , 2015, Journal of vision.

[12]  Xinlei Chen,et al.  Spatial Memory for Context Reasoning in Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  E. Halgren,et al.  Top-down facilitation of visual recognition. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Antonio Torralba,et al.  Statistics of natural image categories , 2003, Network.

[17]  Marios Savvides,et al.  CMS-RCNN: Contextual Multi-Scale Region-based CNN for Unconstrained Face Detection , 2016, ArXiv.

[18]  Georgia Gkioxari,et al.  Contextual Visual Recognition from Images and Videos , 2016 .