Pix2Vox: Context-Aware 3D Reconstruction From Single and Multi-View Images

Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same set of input images with different orders, RNN-based approaches are unable to produce consistent reconstruction results. Moreover, due to long-term memory loss, RNNs cannot fully exploit input images to refine reconstruction results. To solve these problems, we propose a novel framework for single-view and multi-view 3D reconstruction, named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. Then, a context-aware fusion module is introduced to adaptively select high-quality reconstructions for each part (e.g., table legs) from different coarse 3D volumes to obtain a fused 3D volume. Finally, a refiner further refines the fused 3D volume to generate the final output. Experimental results on the ShapeNet and Pix3D benchmarks indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in terms of backward inference time. The experiments on ShapeNet unseen 3D categories have shown the superior generalization abilities of our method.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Jitendra Malik,et al.  Shape, Illumination, and Reflectance from Shading , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yang Zhang,et al.  RealPoint3D: An Efficient Generation Network for 3D Object Reconstruction From a Single Image , 2019, IEEE Access.

[5]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jiajun Wu,et al.  MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[9]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[10]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[12]  Alexei A. Efros,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[15]  A. U.S.,et al.  Recovering Surface Shape and Orientation from Texture , 2002 .

[16]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[17]  R. Venkatesh Babu,et al.  3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image , 2018, BMVC.

[18]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[19]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[22]  José Ruíz Ascencio,et al.  Visual simultaneous localization and mapping: a survey , 2012, Artificial Intelligence Review.

[23]  Markus H. Gross,et al.  Human Shape from Silhouettes Using Generative HKS Descriptors and Cross-Modal Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ronen Basri,et al.  A Survey on Structure from Motion , 2017, ArXiv.

[25]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[26]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[27]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[28]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[29]  Jiajun Wu,et al.  Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Shubham Tulsiani,et al.  Learning Single-view 3D Reconstruction of Objects and Scenes , 2018 .

[31]  Yang Liu,et al.  O-CNN , 2017, ACM Trans. Graph..

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Wonyong Sung,et al.  Single stream parallelization of generalized LSTM-like RNNs on a GPU , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Meng Wang,et al.  3DensiNet: A Robust Neural Network Architecture towards 3D Volumetric Object Prediction from 2D Image , 2017, ACM Multimedia.

[35]  Stefan Roth,et al.  Discriminative shape from shading in uncalibrated illumination , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bo Yang,et al.  Dense 3D Object Reconstruction from a Single Depth View , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Pietro Perona,et al.  3D Reconstruction by Shadow Carving: Theory and Practical Evaluation , 2007, International Journal of Computer Vision.

[38]  Ozan ARSLAN 3d Object Reconstruction from a Single Image. , 2014 .