Contextually Guided Semantic Labeling and Search for 3D Point Clouds

RGB-D cameras, which give an RGB image to- gether with depths, are becoming increasingly popular for robotic perception. In this paper, we address the task of detecting commonly found objects in the 3D point cloud of indoor scenes obtained from such cameras. Our method uses a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurence relationships and geometric relationships. With a large number of object classes and relations, the model's parsimony becomes important and we address that by using multiple types of edge potentials. We train the model using a maximum-margin learning approach. In our experiments over a total of 52 3D scenes of homes and offices (composed from about 550 views), we get a performance of 84.06% and 73.38% in labeling office and home scenes respectively for 17 object classes each. We also present a method for a robot to search for an object using the learned model and the contextual information available from the current labelings of the scene. We applied this algorithm successfully on a mobile robot for the task of finding 12 object classes in 10 different offices and achieved a precision of 97.56% with 78.43% recall.

[1]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[2]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[3]  Ales Leonardis,et al.  A framework for visual-context-aware object detection in still images , 2010, Comput. Vis. Image Underst..

[4]  Luc Van Gool,et al.  Dynamic 3D Scene Analysis from a Moving Vehicle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Tsuhan Chen,et al.  $\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding , 2011, NIPS.

[6]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Jianxiong Xiao,et al.  Multiple view semantic segmentation for street view images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Tal Arbel,et al.  A fast discriminant approach to active object recognition and pose estimation , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[9]  D. Fox,et al.  Classification and Semantic Mapping of Urban Environments , 2011, Int. J. Robotics Res..

[10]  Siddhartha S. Srinivasa,et al.  Structure discovery in multi-modal data: A region-based approach , 2011, 2011 IEEE International Conference on Robotics and Automation.

[11]  O. Barinova,et al.  NON-ASSOCIATIVE MARKOV NETWORKS FOR 3D POINT CLOUD CLASSIFICATION , 2010 .

[12]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[13]  Martial Hebert,et al.  Classifier fusion for outdoor obstacle detection , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[14]  Joel W. Burdick,et al.  A probabilistic framework for object search with 6-DOF pose estimation , 2011, Int. J. Robotics Res..

[15]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[16]  Thorsten Joachims,et al.  Labeling 3D scenes for Personal Assistant Robots , 2011, ArXiv.

[17]  Dieter Fox,et al.  Sparse distance learning for object recognition combining RGB and depth information , 2011, 2011 IEEE International Conference on Robotics and Automation.

[18]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[19]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[20]  Ashutosh Saxena,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[22]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[23]  Joachim Denzler,et al.  Information Theoretic Sensor Data Selection for Active Object Recognition and State Estimation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Ashutosh Saxena,et al.  Co-evolutionary predictors for kinematic pose inference from RGBD images , 2012, GECCO '12.

[25]  Pierre Hansen,et al.  Roof duality, complementation and persistency in quadratic 0–1 optimization , 1984, Math. Program..

[26]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Martial Hebert,et al.  Onboard contextual classification of 3-D point clouds with learned high-order Markov Random Fields , 2009, 2009 IEEE International Conference on Robotics and Automation.

[29]  Quoc V. Le,et al.  High-accuracy 3D sensing for mobile manipulation: Improving object detection and door opening , 2009, 2009 IEEE International Conference on Robotics and Automation.

[30]  Ashutosh Saxena,et al.  Learning 3-D Scene Structure from a Single Still Image , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[31]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Daniel Huber,et al.  Using Context to Create Semantic 3D Models of Indoor Environments , 2010, BMVC.

[34]  Tsuhan Chen,et al.  Robotic Object Detection: Learning to Improve the Classifiers Using Sparse Graphs for Path Planning , 2011, IJCAI.

[35]  Roman Shapovalov,et al.  Cutting-Plane Training of Non-associative Markov Network for 3D Point Cloud Segmentation , 2011, 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission.

[36]  Michael J. Black,et al.  High-order markov random fields for low-level vision , 2007 .

[37]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[38]  Martial Hebert,et al.  Natural terrain classification using three‐dimensional ladar data for ground robot mobility , 2006, J. Field Robotics.

[39]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[41]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[42]  Nico Blodow,et al.  Towards 3D Point cloud based object maps for household environments , 2008, Robotics Auton. Syst..

[43]  Ashutosh Saxena,et al.  Learning the right model: Efficient max-margin learning in Laplacian CRFs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[45]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[47]  Vladimir Kolmogorov,et al.  Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[49]  Vladimir G. Kim,et al.  Shape-based recognition of 3D point clouds in urban environments , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[50]  James J. Little,et al.  Viewpoint detection models for sequential embodied object category recognition , 2010, 2010 IEEE International Conference on Robotics and Automation.

[51]  Andrew Y. Ng,et al.  Integrating Visual and Range Data for Robotic Object Detection , 2008, ECCV 2008.

[52]  Tal Arbel,et al.  Efficient Discriminant Viewpoint Selection for Active Bayesian Recognition , 2006, International Journal of Computer Vision.

[53]  Tsuhan Chen,et al.  Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, NIPS.

[54]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[55]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[56]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[57]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[58]  Pittsburgh,et al.  The MOPED framework: Object recognition and pose estimation for manipulation , 2011 .

[59]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[60]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.