VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to consider all views jointly. Our key insight is that greater detail can be retained without restricting view diversity by learning a view-fusion function conditioned on camera pose and image content. We propose to learn this multi-view fusion using a transformer. To this end, we introduce VoRTX, 1 an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. Our model is occlusion-aware, leveraging the transformer architecture to predict an initial, projective scene geometry estimate. This estimate is used to avoid back-projecting image features through surfaces into occluded regions. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods. We also demonstrate generalization without any fine-tuning, outperforming the same state-of-the-art methods on two other datasets, TUM-RGBD and ICL-NUIM.

[1]  Jing Xu,et al.  Visibility-Aware Point-Based Multi-View Stereo Network , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Andrew Markham,et al.  Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction , 2018, International Journal of Computer Vision.

[3]  Alex Trevithick,et al.  GRF: Learning a General Radiance Field for 3D Scene Representation and Rendering , 2020, ArXiv.

[4]  Luc Van Gool,et al.  Combined Depth and Outlier Estimation in Multi-View Stereo , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Matthias Nießner,et al.  Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Sun,et al.  Symmetric stereo matching for occlusion handling , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[8]  Jing Xu,et al.  Point-Based Multi-View Stereo Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Shaojie Shen,et al.  MVDepthNet: Real-Time Multiview Depth Estimation Neural Network , 2018, 2018 International Conference on 3D Vision (3DV).

[10]  Lu Fang,et al.  SurfaceNet+: An End-to-end 3D Neural Network for Very Sparse Multi-View Stereopsis , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Christian Theobalt,et al.  Occlusion-Aware Depth Estimation with Adaptive Normal Constraints , 2020, ECCV.

[12]  Zhuo Chen,et al.  Attention-Aware Multi-View Stereo , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Takeo Kanade,et al.  A Cooperative Algorithm for Stereo Matching and Occlusion Detection , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[15]  Bing Liu,et al.  Attention Aware Cost Volume Pyramid Based Multi-view Stereo Network for 3D Reconstruction , 2020, ISPRS Journal of Photogrammetry and Remote Sensing.

[16]  Song Han,et al.  Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution , 2020, ECCV.

[17]  Stephen Lin,et al.  DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[18]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[19]  Vijay Badrinarayanan,et al.  Atlas: End-to-End 3D Scene Reconstruction from Posed Images , 2020, ECCV.

[20]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[21]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[23]  Matthias Nießner,et al.  ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Qi Shan,et al.  MVS2D: Efficient Multiview Stereo via Attention-Driven 2D Convolutions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Angela Dai,et al.  TransformerFusion: Monocular RGB Scene Reconstruction using Transformers , 2021, NeurIPS.

[28]  Daniel Thalmann,et al.  3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jan-Michael Frahm,et al.  PatchMatch Based Joint View Selection and Depthmap Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31]  Hanjun Kim,et al.  RGB-to-TSDF: Direct TSDF Prediction from a Single RGB Image for Dense 3D Reconstruction , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[33]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[34]  Vamshhi Pavan Kumar Varma Vegeshna,et al.  Stereo Matching with Color-Weighted Correlation, Hierachical Belief Propagation and Occlusion Handling , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[36]  Anders Bjorholm Dahl,et al.  Large-Scale Data for Multiple-View Stereopsis , 2016, International Journal of Computer Vision.

[37]  Yuxin Hou,et al.  Multi-View Stereo by Temporal Nonparametric Fusion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Zehao Yu,et al.  Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Zhengxia Zou,et al.  Vanet: a View Attention Guided Network for 3d Reconstruction from Single and Multi-View Images , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[41]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Shengping Zhang,et al.  Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Konrad Schindler,et al.  Massively Parallel Multiview Stereopsis by Surface Normal Diffusion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Vladlen Koltun,et al.  Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.

[45]  Christian Theobalt,et al.  Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[47]  Marc Pollefeys,et al.  DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Richard Szeliski,et al.  Handling occlusions in dense multi-view stereo , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[49]  C. Strecha,et al.  Wide-baseline stereo from multiple views: A probabilistic account , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[50]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Hujun Bao,et al.  NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[54]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.