FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.

[1]  Yilun Du,et al.  Learning to Render Novel Views from Wide-Baseline Stereo Pairs , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Shalini De Mello,et al.  Generative Novel View Synthesis with 3D-Aware Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Dmitry Lagun,et al.  MELON: NeRF with Unposed Images Using Equivalence Class Estimation , 2023, ArXiv.

[4]  Yung-Yu Chuang,et al.  Robust Dynamic Radiance Fields , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  K. Matzen,et al.  Perspective Fields for Single Image Camera Calibration , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shubham Tulsiani,et al.  SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Mehdi S. M. Sajjadi,et al.  RUST: Latent Neural Scene Representations from Unposed Imagery , 2022, ArXiv.

[8]  J. Leonard,et al.  NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields , 2022, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Winston H. Hsu,et al.  Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[10]  J. Tenenbaum,et al.  Neural Groundplans: Persistent Neural Scene Representations from a Single Image , 2022, ICLR.

[11]  A. Makadia,et al.  Generalizable Patch-Based Neural Rendering , 2022, ECCV.

[12]  Kuan-Hui Lee,et al.  Learning Optical Flow, Depth, and Scene Flow without Real-World Labels , 2022, IEEE Robotics and Automation Letters.

[13]  Martin R. Oswald,et al.  NICE-SLAM: Neural Implicit Scalable Encoding for SLAM , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Mehdi S. M. Sajjadi,et al.  Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Federico Tombari,et al.  Neural Fields in Visual Computing and Beyond , 2021, Comput. Graph. Forum.

[16]  James M. Rehg,et al.  Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alexei A. Efros,et al.  Video Autoencoder: self-supervised disentanglement of static 3D structure and motion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Patrick Labatut,et al.  Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Jia Deng,et al.  DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[20]  Gerard Pons-Moll,et al.  Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Antonio Torralba,et al.  BARF: Bundle-Adjusting Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Edgar Sucar,et al.  iMAP: Implicit Mapping and Positioning in Real-Time , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Pratul P. Srinivasan,et al.  IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jonathan T. Barron,et al.  iNeRF: Inverting Neural Radiance Fields for Pose Estimation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  J. Kopf,et al.  Robust Consistent Video Depth Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Angjoo Kanazawa,et al.  pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Zhengqi Li,et al.  Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alex Trevithick,et al.  GRF: Learning a General Radiance Field for 3D Representation and Rendering , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Carlos Campos,et al.  ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM , 2020, IEEE Transactions on Robotics.

[30]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[31]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[32]  Vladlen Koltun,et al.  Deep Global Registration , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Gordon Wetzstein,et al.  State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[34]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[35]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[36]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ganesh Iyer,et al.  ∇SLAM: Dense SLAM meets Automatic Differentiation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[39]  Katerina Fragkiadaki,et al.  Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yong-Liang Yang,et al.  RenderNet: A deep convolutional network for differentiable rendering from 3D shapes , 2018, NeurIPS.

[42]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..

[44]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[46]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[50]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[51]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Richard Szeliski,et al.  Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  Ishan Misra,et al.  Multiplane NeRF-Supervised Disentanglement of Depth and Camera Pose from Videos , 2022, ArXiv.

[54]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .