Pixel-Perfect Structure-from-Motion with Featuremetric Refinement

Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this paper, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric estimation, and subsequently refine points and camera poses as a post-processing. This refinement is robust to large detection noise and appearance changes, as it optimizes a featuremetric error based on dense features predicted by a neural network. This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors, challenging viewing conditions, and off-the-shelf deep features. Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale. Our code is publicly available at github.com/cvg/pixel-perfect-sfm as an add-on to the popular SfM software COLMAP.

[1]  M. Pollefeys,et al.  Back to the Feature: Learning Robust Camera Localization from Pixels to Pose , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Pascal Fua,et al.  Image Matching Across Wide Baselines: From Paper to Practice , 2020, International Journal of Computer Vision.

[3]  Torsten Sattler,et al.  Image Retrieval for Image-Based Localization Revisited , 2012, BMVC.

[4]  D. Scaramuzza,et al.  Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis , 2020, International Journal of Computer Vision.

[5]  Binbin Xu,et al.  Deep Probabilistic Feature-Metric Tracking , 2020, IEEE Robotics and Automation Letters.

[6]  Hugo Germain,et al.  S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching , 2020, ArXiv.

[7]  James M. Rehg,et al.  Taking a Deeper Look at the Inverse Compositional Algorithm , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Nan Yang,et al.  LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization , 2020, 2020 International Conference on 3D Vision (3DV).

[9]  Jiri Matas,et al.  Locally Optimized RANSAC , 2003, DAGM-Symposium.

[10]  Jonathan M. Garibaldi,et al.  Real-Time Correlation-Based Stereo Vision with Reduced Border Errors , 2002, International Journal of Computer Vision.

[11]  Stefanos Zafeiriou,et al.  Feature-Based Lucas–Kanade and Active Appearance Models , 2015, IEEE Transactions on Image Processing.

[12]  Torsten Sattler,et al.  A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[14]  Henrik Karstoft,et al.  UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor , 2019, ArXiv.

[15]  Jan-Michael Frahm,et al.  From Dusk Till Dawn: Modeling in the Dark , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[17]  Long Quan,et al.  Recurrent MVSNet for High-Resolution Multi-View Stereo Depth Inference , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jiri Matas,et al.  Efficient Initial Pose-graph Generation for Global SfM , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xinghui Li,et al.  Dual-Resolution Correspondence Networks , 2020, NeurIPS.

[20]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Edward Y. Chang,et al.  CLKN: Cascaded Lucas-Kanade Networks for Image Alignment , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[23]  Nassir Navab,et al.  A Unified Approach Combining Photometric and Geometric Information for Pose Estimation , 2008, BMVC.

[24]  Alexei A. Efros,et al.  RANSAC-Flow: generic two-stage image alignment , 2020, ECCV.

[25]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Torsten Sattler,et al.  Patch2Pix: Epipolar-Guided Pixel-Level Correspondences , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Rares Ambrus,et al.  Neural Outlier Rejection for Self-Supervised Keypoint Learning , 2019, ICLR.

[28]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[29]  Vincent Lepetit,et al.  Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[32]  Marc Pollefeys,et al.  Photometric Bundle Adjustment for Dense Multi-view 3D Modeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[34]  D. Cremers,et al.  GN-Net: The Gauss-Newton Loss for Multi-Weather Relocalization , 2019, IEEE Robotics and Automation Letters.

[35]  Zehao Yu,et al.  Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Torsten Sattler,et al.  Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[38]  Venu Madhav Govindu,et al.  Efficient and Robust Large-Scale Rotation Averaging , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[41]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[42]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[43]  Pascal Fua,et al.  LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[44]  Richard Szeliski,et al.  Pushing the Envelope of Modern Methods for Bundle Adjustment , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Jiri Matas,et al.  Two-view geometry estimation unaffected by a dominant plane , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[46]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Martin Danelljan,et al.  GLU-Net: Global-Local Universal Network for Dense Flow and Correspondences , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tomás Pajdla,et al.  Robust Rotation and Translation Estimation in Multiview Reconstruction , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Torsten Sattler,et al.  D2-Net: A Trainable CNN for Joint Detection and Description of Local Features , 2019, CVPR 2019.

[50]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[51]  Josef Sivic,et al.  Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions , 2020, ECCV.

[52]  Daniel Barath,et al.  Optimal Multi-view Correction of Local Affine Frames , 2019, BMVC.

[53]  Marc Pollefeys,et al.  Online Invariance Selection for Local Feature Descriptors , 2020, ECCV.

[54]  Pascal Fua,et al.  DISK: Learning local features with policy gradient , 2020, NeurIPS.

[55]  Johannes L. Schönberger,et al.  Multi-View Optimization of Local Feature Geometry , 2020, ECCV.

[56]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[57]  Pascal Fua,et al.  Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[58]  P. Holland,et al.  Robust regression using iteratively reweighted least-squares , 1977 .

[59]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[60]  Richard Szeliski,et al.  Bundle Adjustment in the Large , 2010, ECCV.

[61]  Torsten Sattler,et al.  BAD SLAM: Bundle Adjusted Direct RGB-D SLAM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Gabriela Csurka,et al.  R2D2: Repeatable and Reliable Detector and Descriptor , 2019, ArXiv.

[63]  Stefan Leutenegger,et al.  Semantic Texture for Robust Dense Tracking , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[64]  Torsten Sattler,et al.  InLoc: Indoor Visual Localization with Dense Matching and View Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  H. Bischof,et al.  From structure-from-motion point clouds to fast location recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[68]  Brett Browning,et al.  Photometric Bundle Adjustment for Vision-Based SLAM , 2016, ACCV.

[69]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[70]  Tomasz Malisiewicz,et al.  Deep ChArUco: Dark ChArUco Marker Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Yuchao Dai,et al.  Efficient Global 2 D-3 D Matching for Camera Localization in a Large-Scale 3 D Map , 2017 .

[72]  Gérard G. Medioni,et al.  Detection of Intensity Changes with Subpixel Accuracy Using Laplacian-Gaussian Masks , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[74]  Stefan Leutenegger,et al.  LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo , 2018, ECCV.

[75]  Tomás Pajdla,et al.  Neighbourhood Consensus Networks , 2018, NeurIPS.

[76]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[78]  Hujun Bao,et al.  LoFTR: Detector-Free Local Feature Matching with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Hongdong Li,et al.  Efficient Global 2D-3D Matching for Camera Localization in a Large-Scale 3D Map , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[80]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[81]  Dacheng Tao,et al.  Heatmap Regression via Randomized Rounding , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[82]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[83]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[84]  D. Scharstein,et al.  A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms , 2001, Proceedings IEEE Workshop on Stereo and Multi-Baseline Vision (SMBV 2001).

[85]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[86]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Olivier D. Faugeras,et al.  Computing differential properties of 3-D shapes from stereoscopic images without 3-D models , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[88]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[89]  Oliver J. Woodford,et al.  Large Scale Photometric Bundle Adjustment , 2020, BMVC.

[90]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Brett Browning,et al.  Robust Tracking in Low Light and Sudden Illumination Changes , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[92]  Marc Pollefeys,et al.  Illumination change robustness in direct visual SLAM , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[93]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Jan-Michael Frahm,et al.  Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset) , 2015, CVPR 2015.