Learning 3D Object Categories by Looking Around Them

Traditional approaches for learning 3D object categories use either synthetic data or manual supervision. In this paper, we propose a method which does not require manual annotations and is instead cued by observing objects from a moving vantage point. Our system builds on two innovations: a Siamese viewpoint factorization network that robustly aligns different videos together without explicitly comparing 3D shapes; and a 3D shape completion network that can extract the full shape of an object from partial observations. We also demonstrate the benefits of configuring networks to perform probabilistic predictions as well as of geometry-aware data augmentation schemes. We obtain state-of-the-art results on publicly-available benchmarks.

[1]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[2]  Mathieu Aubry,et al.  Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jitendra Malik,et al.  Aligning 3D models to RGB-D images of cluttered scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Silvio Savarese,et al.  A coarse-to-fine model for 3D pose estimation and sub-category recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[6]  Ankush Gupta,et al.  Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[8]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Lawrence G. Roberts,et al.  Machine Perception of Three-Dimensional Solids , 1963, Outstanding Dissertations in the Computer Sciences.

[11]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[12]  P. Fua,et al.  Pose estimation for category specific multiview object localization , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Li Zhang,et al.  Model evolution: An incremental approach to non-rigid structure from motion , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[15]  Silvio Savarese,et al.  A multi-view probabilistic model for 3D object classes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Derek Hoiem,et al.  Completing 3D object shape from one depth image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Feng Liu,et al.  Joint Face Alignment and 3D Face Reconstruction , 2016, ECCV.

[21]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Yannick Morvan,et al.  Acquisition, compression and rendering of depth and texture for multi-view video , 2009 .

[23]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[24]  Jiajun Wu,et al.  Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling , 2016, NIPS.

[25]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[26]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[27]  Thomas Brox,et al.  Unsupervised Generation of a View Point Annotated Car Dataset from Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Lourdes Agapito,et al.  Lifting Object Detection Datasets into 3D , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  David G. Lowe,et al.  Three-Dimensional Object Recognition from Single Two-Dimensional Images , 1987, Artif. Intell..

[30]  Thomas Brox,et al.  Multi-view 3D Models from Single Images with a Convolutional Network , 2015, ECCV.

[31]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[32]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[33]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[37]  Abhinav Gupta,et al.  Marr Revisited: 2D-3D Alignment via Surface Normal Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[40]  Peter V. Gehler,et al.  Teaching 3D geometry to deformable part models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Vladlen Koltun,et al.  A Large Dataset of Object Scans , 2016, ArXiv.

[42]  Bernt Schiele,et al.  Detailed 3D Representations for Object Recognition and Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Peter V. Gehler,et al.  Multi-View Priors for Learning Detectors from Sparse Viewpoint Data , 2014, ICLR.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Ronen Basri,et al.  Viewpoint-aware object detection and pose estimation , 2011, 2011 International Conference on Computer Vision.

[46]  Andrew W. Fitzgibbon,et al.  Finding nemo: Deformable object class modelling using curve matching , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Jitendra Malik,et al.  Category-specific object reconstruction from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Vladlen Koltun,et al.  Single-view reconstruction via joint analysis of image and shape collections , 2015, ACM Trans. Graph..