Learning to Navigate the Energy Landscape

In this paper, we present a novel, general, and efficient architecture for addressing computer vision problems that are approached from an 'Analysis by Synthesis' standpoint. Analysis by synthesis involves the minimization of reconstruction error, which is typically a non-convex function of the latent target variables. State-of-the-art methods adopt a hybrid scheme where discriminatively trained predictors like Random Forests or Convolutional Neural Networks are used to initialize local search algorithms. While these hybrid methods have been shown to produce promising results, they often get stuck in local optima. Our method goes beyond the conventional hybrid architecture by not only proposing multiple accurate initial solutions but by also defining a navigational structure over the solution space that can be used for extremely efficient gradient-free local search. We demonstrate the efficacy and generalizability of our approach on tasks as diverse as Hand Pose Estimation, RGB Camera Relocalization, and Image Retrieval.

[1]  John L. Nazareth,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[2]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[3]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  James J. Little,et al.  Vision-based global localization and mapping for mobile robots , 2005, IEEE Transactions on Robotics.

[5]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Pushmeet Kohli,et al.  Simultaneous Segmentation and Pose Estimation of Humans Using Dynamic Graph Cuts , 2008, International Journal of Computer Vision.

[7]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Alexandr Andoni,et al.  Beyond Locality-Sensitive Hashing , 2013, SODA.

[9]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[10]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[11]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tom Drummond,et al.  Unified Loop Closing and Recovery for Real Time Monocular SLAM , 2008, BMVC.

[13]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[14]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Walterio W. Mayol-Cuevas,et al.  6D Relocalisation for RGBD Cameras Using Synthetic View Regression , 2012, BMVC.

[16]  C. Schmid,et al.  Searching with quantization: approximate nearest neighbor search using short codes and distance estimators , 2009 .

[17]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[18]  Ian D. Reid,et al.  Real-Time SLAM Relocalisation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Harry Shum,et al.  Image segmentation by data driven Markov chain Monte Carlo , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[22]  Antonio Criminisi,et al.  Decision Forests for Computer Vision and Medical Image Analysis , 2013, Advances in Computer Vision and Pattern Recognition.

[23]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[24]  Ben Glocker,et al.  Real-time RGB-D camera relocalization , 2013, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[25]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[26]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[27]  David W. Murray,et al.  Improving the Agility of Keyframe-Based SLAM , 2008, ECCV.

[28]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[29]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[30]  Tae-Kyun Kim,et al.  Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Ian D. Reid,et al.  Automatic Relocalization and Loop Closing for Real-Time Monocular SLAM , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Ulf Grenander Pattern Synthesis: Lectures in Pattern Theory , 1976 .

[33]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[34]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[35]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[36]  Andrew W. Fitzgibbon,et al.  Exploiting uncertainty in regression forests for accurate camera relocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Justus Thies,et al.  Real-time expression transfer for facial reenactment , 2015, ACM Trans. Graph..

[38]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[40]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[41]  Russell A. Brown,et al.  Building a Balanced k-d Tree in O(kn log n) Time , 2014, ArXiv.

[42]  Bruce G. Baumgart,et al.  Geometric modeling for computer vision. , 1974 .

[43]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[44]  Xiao Zhang,et al.  Efficient indexing for large scale visual search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[45]  Sebastian Nowozin,et al.  The informed sampler: A discriminative approach to Bayesian inference in generative computer vision models , 2014, Comput. Vis. Image Underst..

[46]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[47]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[48]  Qian Zhang,et al.  Random Forest for Image Annotation , 2012, ECCV.

[49]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[50]  G. Stiny Shape , 1999 .