Object-Centric Multi-View Aggregation

We present an approach for aggregating a sparse set of views of an object in order to compute a semi-implicit 3D representation in the form of a volumetric feature grid. Key to our approach is an object-centric canonical 3D coordinate system into which views can be lifted, without explicit camera pose estimation, and then combined -- in a manner that can accommodate a variable number of views and is view order independent. We show that computing a symmetry-aware mapping from pixels to the canonical coordinate system allows us to better propagate information to unseen regions, as well as to robustly overcome pose ambiguities during inference. Our aggregate representation enables us to perform 3D inference tasks like volumetric reconstruction and novel view synthesis, and we use these tasks to demonstrate the benefits of our aggregation approach as compared to implicit or camera-centric alternatives.

[1]  Leonidas J. Guibas,et al.  Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction with Implicit Lighting and Material , 2020 .

[3]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[5]  Jean Ponce,et al.  Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[7]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[8]  Sergey Tulyakov,et al.  Transformable Bottleneck Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yiyi Liao,et al.  Deep Marching Cubes: Learning Explicit Surface Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[12]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[13]  Matthew A. Brown,et al.  Unsupervised 3D object recognition and reconstruction in unordered datasets , 2005, Fifth International Conference on 3-D Digital Imaging and Modeling (3DIM'05).

[14]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15]  Gabriele Peters,et al.  The structure-from-motion reconstruction pipeline - a survey with focus on short image sequences , 2010, Kybernetika.

[16]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jitendra Malik,et al.  Hierarchical Surface Prediction for 3D Object Reconstruction , 2017, 2017 International Conference on 3D Vision (3DV).

[18]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Leonidas J. Guibas,et al.  Multiview Aggregation for Learning Category-Specific Shape Reconstruction , 2019, NeurIPS.

[20]  Hao Li,et al.  ARCH: Animatable Reconstruction of Clothed Humans , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Narendra Ahuja,et al.  DeepMVS: Learning Multi-view Stereopsis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[23]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[24]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[25]  J J Koenderink,et al.  Affine structure from motion. , 1991, Journal of the Optical Society of America. A, Optics and image science.

[26]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[27]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[29]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[30]  Carlos Hernandez,et al.  Multi-View Stereo: A Tutorial , 2015, Found. Trends Comput. Graph. Vis..

[31]  Anders P. Eriksson,et al.  Image2Mesh: A Learning Framework for Single Image 3D Reconstruction , 2017, ACCV.

[32]  Mathieu Aubry,et al.  A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Luc Van Gool,et al.  RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  S. Ullman The interpretation of structure from motion , 1979, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[35]  Chen Kong,et al.  Learning Efficient Point Cloud Generation for Dense 3D Object Reconstruction , 2017, AAAI.