SCSCN: A Separated Channel-Spatial Convolution Net With Attention for Single-View Reconstruction

Three-dimensional (3-D) object reconstruction is a challenging problem in computer vision, especially the single-view reconstruction. In this article, we propose a new 3-D reconstruction network, termed as separated channel-spatial convolution net with attention (SCSCN), which can reconstruct the 3-D shape of objects by given a two-dimensional (2-D) image from any viewpoint. Our method is a simple encoder–decoder structure, where the encoder uses separated channel-spatial convolution and separated channel-spatial attention to extract features from the input image, and the decoder recovers 3-D shapes from the features. The separated channel-spatial convolution can obtain channel information and spatial information through the channel path and spatial path separately. At the same time, in order to select a more reasonable combination of features according to the degree of contribution to the reconstruction task, channel attention and spatial attention are relevantly inserted into these two paths. As a result, the encoder can extract a strong representation of object. Quantitative experiments show that our SCSCN has a weak dependence on 3-D supervision and achieves high-quality reconstruction just under 2-D supervision, which proves the effectiveness of the encoder. In addition, we conduct the qualitative visualization experiment to confirm the rationality of the attention blocks in the feature extraction process.

[1]  Alan L. Yuille,et al.  Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images , 2019, International Journal of Computer Vision.

[2]  Xiaojuan Qi,et al.  GAL: Geometric Adversarial Loss for Single-View 3D-Object Reconstruction , 2018, ECCV.

[3]  Bo Yang,et al.  Dense 3D Object Reconstruction from a Single Depth View , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Carlo Gatta,et al.  Unsupervised Deep Feature Extraction for Remote Sensing Image Classification , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Subhransu Maji,et al.  3D Shape Induction from 2D Views of Multiple Objects , 2016, 2017 International Conference on 3D Vision (3DV).

[9]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[10]  Honglak Lee,et al.  Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision , 2016, NIPS.

[11]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[12]  A. Laurentini,et al.  The Visual Hull Concept for Silhouette-Based Image Understanding , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[14]  Rong Xiong,et al.  Stereo Visual-Inertial Odometry With Multiple Kalman Filters Ensemble , 2016, IEEE Transactions on Industrial Electronics.

[15]  Bo Yang,et al.  3D Object Reconstruction from a Single Depth View with Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[16]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[17]  Dong-il Dan Cho,et al.  A Monocular Vision Sensor-Based Efficient SLAM Method for Indoor Service Robots , 2019, IEEE Transactions on Industrial Electronics.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Antonis Nikitakis,et al.  Tensor-Based Classification Models for Hyperspectral Data Analysis , 2017, IEEE Transactions on Geoscience and Remote Sensing.

[20]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[21]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[23]  Silvio Savarese,et al.  Weakly Supervised 3D Reconstruction with Adversarial Constraint , 2017, 2017 International Conference on 3D Vision (3DV).

[24]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Avi Ostfeld,et al.  Protecting Water Infrastructure From Cyber and Physical Threats: Using Multimodal Data Fusion and Adaptive Deep Learning to Monitor Critical Systems , 2019, IEEE Signal Processing Magazine.

[26]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hamido Fujita,et al.  Robust Model Fitting Based on Greedy Search and Specified Inlier Threshold , 2019, IEEE Transactions on Industrial Electronics.

[30]  Andrew W. Fitzgibbon,et al.  What Shape Are Dolphins? Building 3D Morphable Models from 2D Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Song-Chun Zhu,et al.  Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image , 2018, ECCV.

[32]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[34]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[36]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[38]  Meng Wang,et al.  An Automatic Three-Dimensional Scene Reconstruction System Using Crowdsourced Geo-Tagged Videos , 2015, IEEE Transactions on Industrial Electronics.