DSAC — Differentiable RANSAC for Camera Localization

RANSAC is an important algorithm in robust optimization and a central building block for many computer vision applications. In recent years, traditionally hand-crafted pipelines have been replaced by deep learning pipelines, which can be trained in an end-to-end fashion. However, RANSAC has so far not been used as part of such deep learning pipelines, because its hypothesis selection procedure is non-differentiable. In this work, we present two different ways to overcome this limitation. The most promising approach is inspired by reinforcement learning, namely to replace the deterministic hypothesis selection by a probabilistic selection for which we can derive the expected loss w.r.t. to all learnable parameters. We call this approach DSAC, the differentiable counterpart of RANSAC. We apply DSAC to the problem of camera localization, where deep learning has so far failed to improve on traditional approaches. We demonstrate that by directly minimizing the expected loss of the output camera poses, robustly estimated by RANSAC, we achieve an increase in accuracy. In the future, any deep learning pipeline can use DSAC as a robust optimization component.

[1]  Jiri Matas,et al.  Locally Optimized RANSAC , 2003, DAGM-Symposium.

[2]  Mingrui Wu,et al.  Gradient descent optimization of smoothed information retrieval metrics , 2010, Information Retrieval.

[3]  Andrew W. Fitzgibbon,et al.  Exploiting uncertainty in regression forests for accurate camera relocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[7]  Tomasz Malisiewicz,et al.  Deep Image Homography Estimation , 2016, ArXiv.

[8]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[9]  Jochen Trumpf,et al.  L1 rotation averaging using the Weiszfeld algorithm , 2011, CVPR 2011.

[10]  Jiri Matas,et al.  Matching with PROSAC - progressive sample consensus , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[12]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[15]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[16]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[18]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[19]  Eric Brachmann,et al.  Learning Analysis-by-Synthesis for 6D Pose Estimation in RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Eric Brachmann,et al.  Learning 6D Object Pose Estimation Using 3D Object Coordinates , 2014, ECCV.

[23]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Andrew Zisserman,et al.  MLESAC: A New Robust Estimator with Application to Estimating Image Geometry , 2000, Comput. Vis. Image Underst..

[25]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[26]  Jianliang Tang,et al.  Complete Solution Classification for the Perspective-Three-Point Problem , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[28]  Andrew W. Fitzgibbon,et al.  Multi-output Learning for Camera Relocalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  David W. Jacobs,et al.  WarpNet: Weakly Supervised Matching for Single-View Reconstruction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Amnon Shashua,et al.  On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training , 2016, ArXiv.

[32]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[33]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Cordelia Schmid,et al.  DeepMatching: Hierarchical Deformable Dense Matching , 2015, International Journal of Computer Vision.

[37]  David Nistér,et al.  Preemptive RANSAC for live structure and motion estimation , 2005, Machine Vision and Applications.

[38]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Andrea Vedaldi,et al.  Fully-trainable deep matching , 2016, BMVC.

[40]  Jan-Michael Frahm,et al.  A Comparative Analysis of RANSAC Techniques Leading to Adaptive Real-Time Random Sample Consensus , 2008, ECCV.

[41]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.