论文信息 - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.

Yilun Du | V. Sitzmann | Cameron Smith | A. Tewari

[1] Yilun Du,et al. Learning to Render Novel Views from Wide-Baseline Stereo Pairs , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Shalini De Mello,et al. Generative Novel View Synthesis with 3D-Aware Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Dmitry Lagun,et al. MELON: NeRF with Unposed Images Using Equivalence Class Estimation , 2023, ArXiv.

[4] Yung-Yu Chuang,et al. Robust Dynamic Radiance Fields , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] K. Matzen,et al. Perspective Fields for Single Image Camera Calibration , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Shubham Tulsiani,et al. SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Mehdi S. M. Sajjadi,et al. RUST: Latent Neural Scene Representations from Unposed Imagery , 2022, ArXiv.

[8] J. Leonard,et al. NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields , 2022, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9] Winston H. Hsu,et al. Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping , 2022, 2023 IEEE International Conference on Robotics and Automation (ICRA).

[10] J. Tenenbaum,et al. Neural Groundplans: Persistent Neural Scene Representations from a Single Image , 2022, ICLR.

[11] A. Makadia,et al. Generalizable Patch-Based Neural Rendering , 2022, ECCV.

[12] Kuan-Hui Lee,et al. Learning Optical Flow, Depth, and Scene Flow without Real-World Labels , 2022, IEEE Robotics and Automation Letters.

[13] Martin R. Oswald,et al. NICE-SLAM: Neural Implicit Scalable Encoding for SLAM , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Mehdi S. M. Sajjadi,et al. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Federico Tombari,et al. Neural Fields in Visual Computing and Beyond , 2021, Comput. Graph. Forum.

[16] James M. Rehg,et al. Ego4D: Around the World in 3,000 Hours of Egocentric Video , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Alexei A. Efros,et al. Video Autoencoder: self-supervised disentanglement of static 3D structure and motion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Patrick Labatut,et al. Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Jia Deng,et al. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[20] Gerard Pons-Moll,et al. Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Antonio Torralba,et al. BARF: Bundle-Adjusting Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Edgar Sucar,et al. iMAP: Implicit Mapping and Positioning in Real-Time , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Pratul P. Srinivasan,et al. IBRNet: Learning Multi-View Image-Based Rendering , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Jonathan T. Barron,et al. iNeRF: Inverting Neural Radiance Fields for Pose Estimation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25] J. Kopf,et al. Robust Consistent Video Depth Estimation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Angjoo Kanazawa,et al. pixelNeRF: Neural Radiance Fields from One or Few Images , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Zhengqi Li,et al. Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Alex Trevithick,et al. GRF: Learning a General Radiance Field for 3D Representation and Rendering , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Carlos Campos,et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM , 2020, IEEE Transactions on Robotics.

[30] Jonathan T. Barron,et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[31] Gordon Wetzstein,et al. Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[32] Vladlen Koltun,et al. Deep Global Registration , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Gordon Wetzstein,et al. State of the Art on Neural Rendering , 2020, Comput. Graph. Forum.

[34] Jia Deng,et al. RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[35] Pratul P. Srinivasan,et al. NeRF , 2020, ECCV.

[36] Andreas Geiger,et al. Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Ganesh Iyer,et al. ∇SLAM: Dense SLAM meets Automatic Differentiation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[38] Gordon Wetzstein,et al. Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[39] Katerina Fragkiadaki,et al. Learning Spatial Common Sense With Geometry-Aware Recurrent Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Gordon Wetzstein,et al. DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Yong-Liang Yang,et al. RenderNet: A deep convolutional network for differentiable rendering from 3D shapes , 2018, NeurIPS.

[42] Gabriel J. Brostow,et al. Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] John Flynn,et al. Stereo magnification , 2018, ACM Trans. Graph..

[44] Zhichao Yin,et al. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[46] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Oisin Mac Aodha,et al. Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Jan-Michael Frahm,et al. Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] J. M. M. Montiel,et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[50] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[51] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52] Richard Szeliski,et al. Building Rome in a day , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53] Ishan Misra,et al. Multiplane NeRF-Supervised Disentanglement of Depth and Camera Pose from Videos , 2022, ArXiv.

[54] Bernhard P. Wrobel,et al. Multiple View Geometry in Computer Vision , 2001 .