Semantic structure from motion

Conventional rigid structure from motion (SFM) addresses the problem of recovering the camera parameters (motion) and the 3D locations (structure) of scene points, given observed 2D image feature points. In this paper, we propose a new formulation called Semantic Structure From Motion (SSFM). In addition to the geometrical constraints provided by SFM, SSFM takes advantage of both semantic and geometrical properties associated with objects in the scene (Fig. 1). These properties allow us to recover not only the structure and motion but also the 3D locations, poses, and categories of objects in the scene. We cast this problem as a max-likelihood problem where geometry (cameras, points, objects) and semantic information (object classes) are simultaneously estimated. The key intuition is that, in addition to image features, the measurements of objects across views provide additional geometrical constraints that relate cameras and scene parameters. These constraints make the geometry estimation process more robust and, in turn, make object detection more accurate. Our framework has the unique ability to: i) estimate camera poses only from object detections, ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms, iii) improve object detections given multiple un-calibrated images, compared to independently detecting objects in single images. Extensive quantitative results on three datasets–LiDAR cars, street-view pedestrians, and Kinect office desktop–verify our theoretical claims.

[1]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[2]  Mubarak Shah,et al.  A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint , 2006, ECCV.

[3]  Silvio Savarese,et al.  Application of D4AR - A 4-Dimensional augmented reality model for automating construction progress monitoring data collection, processing and communication , 2009, J. Inf. Technol. Constr..

[4]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[5]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  Luc Van Gool,et al.  From images to 3D models , 2002, CACM.

[7]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[9]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[10]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[11]  Li Fei-Fei,et al.  Towards total scene understanding: Classification, annotation and segmentation in an automatic framework , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Antonio Torralba,et al.  Depth from Familiar Objects: A Hierarchical Model for 3D Scenes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Daniel F. Huber,et al.  Automatic 3D modeling using range images obtained from unknown viewpoints , 2001, Proceedings Third International Conference on 3-D Digital Imaging and Modeling.

[16]  Silvio Savarese,et al.  Toward coherent object detection and scene layout understanding , 2011, Image Vis. Comput..

[17]  Ryan M. Eustice,et al.  Ford Campus vision and lidar data set , 2011, Int. J. Robotics Res..

[18]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[19]  Jitendra Malik,et al.  Recognizing Objects in Range Data Using Regional Point Descriptors , 2004, ECCV.

[20]  Roberto Cipolla,et al.  Modelling and Interpretation of Architecture from Several Images , 2004, International Journal of Computer Vision.

[21]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Luc Van Gool,et al.  3D Urban Scene Modeling Integrating Recognition and Reconstruction , 2008, International Journal of Computer Vision.

[23]  Pietro Perona,et al.  Reducing "Structure From Motion": A General Framework for Dynamic Vision Part 1: Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Silvio Savarese,et al.  3D generic object categorization, localization and pose estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  Roberto Cipolla,et al.  Structure from motion , 2008 .

[26]  Frank Dellaert,et al.  Feature Correspondence: A Markov Chain Monte Carlo Approach , 2000, NIPS.

[27]  Luc Van Gool,et al.  Wide Baseline Stereo Matching based on Local, Affinely Invariant Regions , 2000, BMVC.

[28]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[30]  Nico Blodow,et al.  Towards 3D Point cloud based object maps for household environments , 2008, Robotics Auton. Syst..

[31]  Tim Weyrich,et al.  Capturing Time-of-Flight data with confidence , 2011, CVPR 2011.

[32]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[33]  David Nistér,et al.  An efficient solution to the five-point relative pose problem , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..