Probabilistic Structure from Motion with Objects (PSfMO)

This paper proposes a probabilistic approach to recover affine camera calibration and objects position/occupancy from multi-view images using solely the information from image detections. We show that remarkable object localisation and volumetric occupancy can be recovered by including both geometrical constraints and prior information given by objects CAD models from the ShapeNet dataset. This can be done by recasting the problem in the context of a probabilistic framework based on PPCA that enforces both geometrical constraints and the associated semantic given by the object category extracted by the object detector We present results on synthetic data and extensive real evaluation on the ScanNet datasets on more than 1200 image sequences to show the validity of our approach in realistic scenarios. In particular, we show that 3D statistical priors are key to obtain reliable reconstruction especially when the input detections are noisy, a likely case in real scenes.

[1]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Nico Blodow,et al.  Close-range scene segmentation and reconstruction of 3D point cloud maps for mobile manipulation in domestic environments , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Fredrik Kahl,et al.  Multiview reconstruction of space curves , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[5]  Luigi di Stefano,et al.  Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  David A. Forsyth,et al.  Bayesian structure from motion , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[7]  Aaron Hertzmann,et al.  Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Alessio Del Bue,et al.  Structure from Motion with Objects , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Aleix M. Martínez,et al.  Computing Smooth Time Trajectories for Camera and Deformable Shape in Structure from Motion with Occlusion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[12]  Silvio Savarese,et al.  3D Scene Understanding by Voxel-CRF , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Alessio Del Bue,et al.  Non-Rigid Metric Shape and Motion Recovery from Uncalibrated Images Using Priors , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Anders Heyden,et al.  Affine Structure and Motion from Points, Lines and Conics , 1999, International Journal of Computer Vision.

[15]  Andrea Fusiello,et al.  Hierarchical structure-and-motion recovery from uncalibrated images , 2015, Comput. Vis. Image Underst..

[16]  Jonathan R Goodman,et al.  Ensemble samplers with affine invariance , 2010 .

[17]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  S. R. Searle,et al.  Vec and vech operators for matrices, with some uses in jacobians and multivariate statistics , 1979 .

[20]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[21]  Frank Dellaert,et al.  MCMC-Based Multiview Reconstruction of Piecewise Smooth Subdivision Curves with a Variable Number of Control Points , 2004, ECCV.

[22]  Markus Vincze,et al.  Efficient 3D Object Detection by Fitting Superquadrics to Range Image Data for Robot's Object Manipulation , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[23]  Anders Heyden,et al.  Visibility constrained surface evolution , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[24]  Frank Dellaert,et al.  Line-Based Structure from Motion for Urban Environments , 2006, Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06).

[25]  Takeo Kanade,et al.  Trajectory Space: A Dual Representation for Nonrigid Structure from Motion , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[27]  Takeo Kanade,et al.  A factorization method for affine structure from line correspondences , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  Songhwai Oh,et al.  Complex Non-rigid 3D Shape Recovery Using a Procrustean Normal Distribution Mixture Model , 2015, International Journal of Computer Vision.

[29]  C. Wojek,et al.  D Traffic Scene Understanding from Movable Platforms , 2013 .

[30]  Leonidas J. Guibas,et al.  Data-driven structural priors for shape completion , 2015, ACM Trans. Graph..

[31]  Silvio Savarese,et al.  Semantic structure from motion with points, regions, and objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Martin Lauer,et al.  3D Traffic Scene Understanding From Movable Platforms , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Leo Reyes,et al.  The projective reconstruction of points, lines, quadrics, plane conics and degenerate quadrics using uncalibrated cameras , 2005, Image Vis. Comput..

[34]  Ali Farhadi,et al.  Incorporating Scene Context and Object Layout into Appearance Modeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Adrien Bartoli,et al.  Structure-from-motion using lines: Representation, triangulation, and bundle adjustment , 2005, Comput. Vis. Image Underst..

[37]  Andrew W. Fitzgibbon,et al.  Towards Pointless Structure from Motion: 3D Reconstruction and Camera Parameters from General 3D Curves , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[39]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Richard Szeliski,et al.  Bundle Adjustment in the Large , 2010, ECCV.

[42]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[43]  Ian D. Reid,et al.  Dense Reconstruction Using 3D Object Shape Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Silvio Savarese,et al.  Semantic structure from motion , 2011, CVPR 2011.

[45]  K. Madhava Krishna,et al.  Dynamic body VSLAM with semantic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tomás Pajdla,et al.  Line reconstruction from many perspective images by factorization , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[48]  Alessio Del Bue,et al.  Adaptive Non-rigid Registration and Structure from Motion from Image Trajectories , 2013 .

[49]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[51]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[52]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[53]  Adrien Bartoli,et al.  Implicit Non-Rigid Structure-from-Motion with Priors , 2008, Journal of Mathematical Imaging and Vision.

[54]  Stefano Soatto,et al.  Visual-Inertial-Semantic Scene Representation for 3D Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  K. Kanatani,et al.  Factorization without Factorization: Complete Recipe , 2004 .

[56]  Andrew Zisserman,et al.  Quadric reconstruction from dual-space geometry , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[57]  Kun Zhou,et al.  An interactive approach to semantic modeling of indoor scenes with an RGBD camera , 2012, ACM Trans. Graph..

[58]  Alessio Del Bue,et al.  A factorization approach to structure from motion with shape priors , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Yeung Sam Hung,et al.  3D Curves Reconstruction from Multiple Images , 2010, 2010 International Conference on Digital Image Computing: Techniques and Applications.

[60]  Anders Heyden,et al.  Reconstruction of General Curves, Using Factorization and Bundle Adjustment , 2004, International Journal of Computer Vision.