Robust Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

We study the problem of recovering an underlying 3D shape from a set of images. Existing learning based approaches usually resort to recurrent neural nets, e.g., GRU, or intuitive pooling operations, e.g., max/mean poolings, to fuse multiple deep features encoded from input images. However, GRU based approaches are unable to consistently estimate 3D shapes given different permutations of the same set of input images as the recurrent unit is permutation variant. It is also unlikely to refine the 3D shape given more images due to the long-term memory loss of GRU. Commonly used pooling approaches are limited to capturing partial information, e.g., max/mean values, ignoring other valuable features. In this paper, we present a new feed-forward neural module, named AttSets , together with a dedicated training algorithm, named FASet , to attentively aggregate an arbitrarily sized deep feature set for multi-view 3D reconstruction. The AttSets module is permutation invariant, computationally efficient and flexible to implement, while the FASet algorithm enables the AttSets based network to be remarkably robust and generalize to an arbitrary number of input images. We thoroughly evaluate FASet and the properties of AttSets on multiple large public datasets. Extensive experiments show that AttSets together with FASet algorithm significantly outperforms existing aggregation approaches.

[1]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Hongdong Li,et al.  “Maximizing Rigidity” Revisited: A Convex Programming Approach for Generic 3D Shape Reconstruction from Multiple Perspective Views , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Mathieu Salzmann,et al.  Deep Attentional Structured Representation Learning for Visual Recognition , 2018, BMVC.

[4]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[5]  Andrew Zisserman,et al.  Learning to Predict 3D Surfaces of Sculptures from Single and Multiple Views , 2018, International Journal of Computer Vision.

[6]  Junsong Yuan,et al.  Multi-view Harmonized Bilinear Network for 3D Object Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[8]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[9]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ronen Basri,et al.  A Survey on Structure from Motion , 2017, ArXiv.

[11]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[14]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[16]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.

[17]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[18]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[19]  Horst Bischof,et al.  OctNetFusion: Learning Depth Fusion from Data , 2017, 2017 International Conference on 3D Vision (3DV).

[20]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Pengfei Xiong,et al.  Pyramid Attention Network for Semantic Segmentation , 2018, BMVC.

[22]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[23]  Hongbin Zha,et al.  PSDF Fusion: Probabilistic Signed Distance Function for On-the-fly 3D Data Fusion and Scene Reconstruction , 2018, ECCV.

[24]  Subhransu Maji,et al.  Second-order Democratic Aggregation , 2018, ECCV.

[25]  Mathieu Salzmann,et al.  Statistically Motivated Second Order Pooling , 2018, ECCV.

[26]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[27]  Shi-Min Hu,et al.  Learning to Reconstruct High-Quality 3D Shapes with Cascaded Fully Convolutional Networks , 2018, ECCV.

[28]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[30]  Jordi Gonzàlez,et al.  Attend and Rectify: a Gated Attention Mechanism for Fine-Grained Recovery , 2018, ECCV.

[31]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[32]  Luc Van Gool,et al.  RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jiong Wang,et al.  Attention-based Pyramid Aggregation Network for Visual Place Recognition , 2018, ACM Multimedia.

[35]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[36]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Ioannis A. Kakadiaris,et al.  Deep Imbalanced Attribute Classification using Visual Attention Aggregation , 2018, ECCV.

[39]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Eric Martin,et al.  Parallelizing Linear Recurrent Neural Nets Over Sequence Length , 2017, ICLR.

[41]  Hongdong Li,et al.  Monocular Dense 3D Reconstruction of a Complex Dynamic Scene from Two Perspective Frames , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Chao Yang,et al.  Dependency-Aware Attention Control for Unconstrained Face Recognition with Image Sets , 2018, ECCV.

[43]  Andrew Zisserman,et al.  SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes , 2017, BMVC.

[44]  Subhransu Maji,et al.  Improved Bilinear Pooling with CNNs , 2017, BMVC.

[45]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[46]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[47]  R. Mesiar,et al.  Aggregation operators: properties, classes and construction methods , 2002 .

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[50]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[51]  Xin Yang,et al.  Active Object Reconstruction Using a Guided View Planner , 2018, IJCAI.

[52]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[53]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Rastko R. Selmic,et al.  Classifying Unordered Feature Sets with Convolutional Deep Averaging Networks , 2017, 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC).

[56]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Narendra Ahuja,et al.  DeepMVS: Learning Multi-view Stereopsis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Cristian Sminchisescu,et al.  Matrix Backpropagation for Deep Networks with Structured Layers , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).