Object detection through search with a foveated visual system

Humans and many other species sense visual information with varying spatial resolution across the visual field (foveated vision) and deploy eye movements to actively sample regions of interests in scenes. The advantage of such varying resolution architecture is a reduced computational, hence metabolic cost. But what are the performance costs of such processing strategy relative to a scheme that processes the visual field at high spatial resolution? Here we first focus on visual search and combine object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We develop a foveated object detector that processes the entire scene with varying resolution, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. We compared the foveated object detector against a non-foveated version of the same object detector which processes the entire image at homogeneous high spatial resolution. We evaluated the accuracy of the foveated and non-foveated object detectors identifying 20 different objects classes in scenes from a standard computer vision data set (the PASCAL VOC 2007 dataset). We show that the foveated object detector can approximate the performance of the object detector with homogeneous high spatial resolution processing while bringing significant computational cost savings. Additionally, we assessed the impact of foveation on the computation of bottom-up saliency. An implementation of a simple foveated bottom-up saliency model with eye movements showed agreement in the selection of top salient regions of scenes with those selected by a non-foveated high resolution saliency model. Together, our results might help explain the evolution of foveated visual systems with eye movements as a solution that preserves perceptual performance in visual search while resulting in computational and metabolic savings to the brain.

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[3]  Cristian Sminchisescu,et al.  Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  G. Zelinsky A theory of eye movements during target acquisition. , 2008, Psychological review.

[5]  Trevor Darrell,et al.  Sparselet Models for Efficient Multiclass Object Detection , 2012, ECCV.

[6]  Fadi Dornaika,et al.  Attentive Wide-Field Sensing for Visual Telepresence and Surveillance , 2004 .

[7]  Jitendra Malik,et al.  Discriminative Decorrelation for Clustering and Classification , 2012, ECCV.

[8]  Jordi Gonzàlez,et al.  A coarse-to-fine approach for fast deformable object detection , 2011, CVPR 2011.

[9]  Frank Thorn,et al.  Refractive error-dependent differences in accommodation after blur adaptation , 2010 .

[10]  HIROYUKI YAMAMOTO,et al.  An Active Foveated Vision System: Attentional Mechanisms and Scan Path Covergence Measures , 1996, Comput. Vis. Image Underst..

[11]  Miguel P Eckstein,et al.  Similar Neural Representations of the Target for Saccades and Perception during Search , 2007, The Journal of Neuroscience.

[12]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  David A. Forsyth,et al.  Fast Template Evaluation with Vector Quantization , 2013, NIPS.

[14]  Javier R. Movellan,et al.  Infomax Control of Eye Movements , 2010, IEEE Transactions on Autonomous Mental Development.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Michael F. Land,et al.  Oculomotor behaviour in vertebrates and invertebrates , 2011 .

[17]  W. Geisler,et al.  Retina-V1 model of detectability across the visual field. , 2014, Journal of vision.

[18]  Li Zhaoping,et al.  Feedback from higher to lower visual areas for visual recognition may be weaker in the periphery: Glimpses from the perception of brief dichoptic stimuli , 2017, Vision Research.

[19]  J Rovamo,et al.  Temporal Integration and Contrast Sensitivity in Foveal and Peripheral Vision , 1984, Perception.

[20]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[21]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Miguel P Eckstein,et al.  Beyond Scene Gist: Objects Guide Search More Than Scene Background , 2017, Journal of experimental psychology. Human perception and performance.

[23]  Iasonas Kokkinos Bounding Part Scores for Rapid Detection with Deformable Part Models , 2012, ECCV Workshops.

[24]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Miguel P Eckstein,et al.  Attentional Cues in Real Scenes, Saccadic Targeting, and Bayesian Priors , 2005, Psychological science.

[26]  C. Lawrence Zitnick,et al.  Edge Boxes: Locating Object Proposals from Edges , 2014, ECCV.

[27]  Miguel P Eckstein,et al.  Saccadic and perceptual performance in visual search tasks. I. Contrast detection and discrimination. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[28]  Miguel P. Eckstein,et al.  Probabilistic Computations for Attention, Eye Movements, and Search. , 2017, Annual review of vision science.

[29]  Jitendra Malik,et al.  An Information Maximization Model of Eye Movements , 2004, NIPS.

[30]  Miguel P. Eckstein,et al.  Evolution and Optimality of Similar Neural Mechanisms for Perception and Action during Search , 2010, PLoS Comput. Biol..

[31]  Lauren E. Welbourne,et al.  Humans, but Not Deep Neural Networks, Often Miss Giant Targets in Scenes , 2017, Current Biology.

[32]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Wilson S. Geisler,et al.  Simple summation rule for optimal fixation selection in visual search , 2009, Vision Research.

[34]  James H. Elder,et al.  Pre-Attentive Face Detection for Foveated Wide-Field Surveillance , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[35]  Yee Whye Teh,et al.  Searching for objects driven by context , 2012, NIPS.

[36]  Eero P. Simoncelli,et al.  Metamers of the ventral stream , 2011, Nature Neuroscience.

[37]  R. Rosenholtz Capabilities and Limitations of Peripheral Vision. , 2016, Annual review of vision science.

[38]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[39]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[40]  Bernt Schiele,et al.  What Makes for Effective Detection Proposals? , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  D. Dacey Physiology, morphology and spatial densities of identified ganglion cell types in primate retina. , 1994, Ciba Foundation symposium.

[42]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[43]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[44]  I. Rentschler,et al.  Peripheral vision and pattern recognition: a review. , 2011, Journal of vision.

[45]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Li Zhaoping,et al.  Understanding Vision: Theory, Models, and Data , 2014 .

[47]  Antonio Torralba,et al.  A Tree-Based Context Model for Object Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[49]  Xin Chen,et al.  Real-world visual search is dominated by top-down guidance , 2006, Vision Research.

[50]  James H. Elder,et al.  Statistical cue integration for foveated wide-field surveillance , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[51]  Preeti Verghese,et al.  Active search for multiple targets is inefficient , 2010, Vision Research.

[52]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[53]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  James H. Elder,et al.  Pre-Attentive and Attentive Detection of Humans in Wide-Field Scenes , 2007, International Journal of Computer Vision.

[56]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[57]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[58]  Miguel P Eckstein,et al.  Visual search: a retrospective. , 2011, Journal of vision.

[59]  S. Klein,et al.  Vernier acuity, crowding and cortical magnification , 1985, Vision Research.

[60]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[61]  Miguel P. Eckstein,et al.  Foveal analysis and peripheral selection during active visual sampling , 2014, Proceedings of the National Academy of Sciences.

[62]  Sheng Zhang,et al.  Optimal and human eye movements to clustered low value cues to increase decision rewards during search , 2015, Vision Research.

[63]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[64]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[65]  Li Zhaoping,et al.  The distribution of visual objects on the retina: connecting eye movements and cone distributions. , 2003, Journal of vision.

[66]  Christoph H. Lampert An efficient divide-and-conquer cascade for nonlinear object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[67]  Fei Guo,et al.  Neural Representations of Contextual Guidance in Visual Search of Real-World Scenes , 2013, The Journal of Neuroscience.

[68]  Iasonas Kokkinos,et al.  Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound , 2011, NIPS.

[69]  Laurence T. Maloney,et al.  Human Visual Search Does Not Maximize the Post-Saccadic Probability of Identifying Targets , 2012, PLoS Comput. Biol..

[70]  Richard F Murray,et al.  Saccadic and perceptual performance in visual search tasks. II. Letter discrimination. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[71]  Alexei A. Efros,et al.  How Important Are "Deformable Parts" in the Deformable Parts Model? , 2012, ECCV Workshops.

[72]  C. Curcio,et al.  Packing geometry of human cone photoreceptors: variation with eccentricity and evidence for local anisotropy. , 1992, Visual neuroscience.

[73]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[74]  M F Land,et al.  Shrimps that pay attention: saccadic eye movements in stomatopod crustaceans , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[75]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[76]  Wilson S. Geisler,et al.  Optimal eye movement strategies in visual search , 2005, Nature.

[77]  C. Curcio,et al.  Topography of ganglion cells in human retina , 1990, The Journal of comparative neurology.

[78]  Christoph H. Lampert,et al.  Efficient Subwindow Search: A Branch and Bound Framework for Object Localization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Peter Kontschieder,et al.  Context-Sensitive Decision Forests for Object Detection , 2012, NIPS.

[80]  Wei Zhang,et al.  A Computational Model of Eye Movements during Object Class Detection , 2005, NIPS.

[81]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[83]  Gregory J. Zelinsky,et al.  Scene context guides eye movements during visual search , 2006, Vision Research.

[84]  Miguel P Eckstein,et al.  Object co-occurrence serves as a contextual cue to guide and facilitate visual search in a natural viewing environment. , 2011, Journal of vision.

[85]  J. Findlay Saccade Target Selection During Visual Search , 1997, Vision Research.

[86]  Deva Ramanan,et al.  Histograms of Sparse Codes for Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[87]  A. Hendrickson,et al.  Human photoreceptor topography , 1990, The Journal of comparative neurology.

[88]  Zhaoping Li,et al.  A Neural Model of Contour Integration in the Primary Visual Cortex , 1998, Neural Computation.

[89]  A. Cowey,et al.  Preferential representation of the fovea in the primary visual cortex , 1993, Nature.

[90]  Zhaoping Li A saliency map in primary visual cortex , 2002, Trends in Cognitive Sciences.

[91]  Luc Van Gool,et al.  Scalable multi-class object detection , 2011, CVPR 2011.

[92]  P. Subramanian Active Vision: The Psychology of Looking and Seeing , 2006 .

[93]  Jan Churan,et al.  Perceptual compression of visual space during eye-head gaze shifts. , 2011, Journal of vision.

[94]  Daphne Koller,et al.  Discriminative learning of relaxed hierarchy for large-scale visual recognition , 2011, 2011 International Conference on Computer Vision.

[95]  Nando de Freitas,et al.  Learning attentional policies for tracking and recognition in video with deep networks , 2011, ICML.

[96]  George L. Malcolm,et al.  The effects of target template specificity on visual search in real-world scenes: evidence from eye movements. , 2009, Journal of vision.

[97]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[98]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[99]  Drew H. Abney,et al.  Journal of Experimental Psychology : Human Perception and Performance Influence of Musical Groove on Postural Sway , 2015 .