Feature-Based Resource Allocation for Real-Time Stereo Disparity Estimation

The most accurate stereo disparity algorithms take dozens or hundreds of seconds to process a single frame. This timescale is impractical for many applications. However, high accuracy is often not needed throughout the scene. Here, we investigate a “foveation” approach (in which some parts of an image are processed more intensively than others) in the context of modern stereo algorithms. We consider two scenarios: disparity estimation with a convolutional network in a robotic grasping context, and disparity estimation with a Markov random field in a navigation context. In each case, combining fast and slow methods in different parts of the scene improves frame rates while maintaining accuracy in the most task-relevant areas. We also demonstrate a simple and broadly applicable utility function for choosing foveal regions, which combines image and task information. Finally, we characterize the benefits of defining multiple individually placed small foveae per image, rather than a single large fovea. We find little benefit, supporting the use of hardware foveae of fixed size and shape. More generally, our results reaffirm that foveation is a practical way to combine speed with task-relevant accuracy. Foveae are present in the most complex biological vision systems, suggesting that they may become more important in artificial vision systems, as these systems become more complex.

[1]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Alan C. Bovik,et al.  FOVEA: a foveated vergent active stereo vision system for dynamic three-dimensional scene recovery , 1998, IEEE Trans. Robotics Autom..

[3]  Carl F. R. Weiman,et al.  A Modification of the Fusion Model for Log Polar Coordinates , 1990, Other Conferences.

[4]  Bryan P. Tripp,et al.  Design of a Saccading and Accommodating Robot Vision System , 2016, 2016 13th Conference on Computer and Robot Vision (CRV).

[5]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[6]  Svetha Venkatesh,et al.  Use of log polar space for foveation and feature recognition , 1997 .

[7]  Ales Ude,et al.  Object Learning through Interactive Manipulation and Foveated Vision , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[8]  Alexandre Bernardino,et al.  A Binocular Stereo Algorithm for Log-Polar Foveated Systems , 2002, Biologically Motivated Computer Vision.

[9]  Steve B. Furber,et al.  The SpiNNaker Project , 2014, Proceedings of the IEEE.

[10]  Eileen Kowler Eye movements: The past 25years , 2011, Vision Research.

[11]  R. Johansson,et al.  Eye–Hand Coordination in Object Manipulation , 2001, The Journal of Neuroscience.

[12]  Alexandre Bernardino,et al.  A review of log-polar imaging for visual perception in robotics , 2010, Robotics and Autonomous Systems.

[13]  Mark Desnoyer,et al.  Visual Utility: A Framework for Focusing Computer Vision Algorithms , 2015 .

[14]  Alexander C. Schütz,et al.  Eye movements and perception: a selective review. , 2011, Journal of vision.

[15]  Wei Zhang,et al.  The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search , 2005, NIPS.

[16]  Takatsugu Hirayama,et al.  Computational Models of Human Visual Attention and Their Implementations: A Survey , 2013, IEICE Trans. Inf. Syst..

[17]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[18]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[19]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[20]  Paul Fieguth,et al.  Statistical Image Processing and Multidimensional Modeling , 2010 .

[21]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Francisco Sandoval,et al.  Multifoveal imager for stereo applications , 2002, Int. J. Imaging Syst. Technol..

[23]  Javier R. Movellan,et al.  Optimal scanning for faster object detection , 2009, CVPR.

[24]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[25]  Bruno Motta de Carvalho,et al.  Real time vision for robotics using a moving fovea approach with multi resolution , 2008, 2008 IEEE International Conference on Robotics and Automation.

[26]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[27]  Gary R. Bradski,et al.  Peripheral-Foveal Vision for Real-time Object Recognition and Tracking in Video , 2007, IJCAI.

[28]  Pedro F. Felzenszwalb,et al.  Efficient belief propagation for early vision , 2004, CVPR 2004.

[29]  D. Ballard,et al.  Eye movements in natural behavior , 2005, Trends in Cognitive Sciences.

[30]  Andreas Geiger,et al.  Displets: Resolving stereo ambiguities using object knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[32]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[33]  Benjamin Ranft,et al.  Modeling arbitrarily oriented slanted planes for efficient stereo vision based on block matching , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[34]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[35]  Lorenzo Rosasco,et al.  Enabling Depth-Driven Visual Attention on the iCub Humanoid Robot: Instructions for Use and New Perspectives , 2015, Front. Robot. AI.

[36]  Carl F. R. Weiman,et al.  Tracking Algorithms Using Log-Polar Mapped Image Coordinates , 1990, Other Conferences.

[37]  D. Whitteridge,et al.  The representation of the visual field on the cerebral cortex in monkeys , 1961, The Journal of physiology.

[38]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Nanning Zheng,et al.  Stereo Matching Using Belief Propagation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Mary M Hayhoe,et al.  Task and context determine where you look. , 2016, Journal of vision.

[41]  Atsuto Maki,et al.  Attentional Scene Segmentation: Integrating Depth and Motion , 2000, Comput. Vis. Image Underst..

[42]  Yann LeCun,et al.  Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches , 2015, J. Mach. Learn. Res..

[43]  John K. Tsotsos A Computational Perspective on Visual Attention , 2011 .

[44]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[45]  Cheng Soon Ong,et al.  Multivariate spearman's ρ for aggregating ranks using copulas , 2016 .

[46]  Keith Rayner,et al.  Eye movements, the perceptual span, and reading speed , 2010, Psychonomic bulletin & review.

[47]  Henrik I. Christensen,et al.  Computational visual attention systems and their cognitive foundations: A survey , 2010, TAP.