Contextually guided semantic labeling and search for three-dimensional point clouds

RGB-D cameras, which give an RGB image together with depths, are becoming increasingly popular for robotic perception. In this paper, we address the task of detecting commonly found objects in the three-dimensional (3D) point cloud of indoor scenes obtained from such cameras. Our method uses a graphical model that captures various features and contextual relations, including the local visual appearance and shape cues, object co-occurrence relationships and geometric relationships. With a large number of object classes and relations, the model’s parsimony becomes important and we address that by using multiple types of edge potentials. We train the model using a maximum-margin learning approach. In our experiments concerning a total of 52 3D scenes of homes and offices (composed from about 550 views), we get a performance of 84.06% and 73.38% in labeling office and home scenes respectively for 17 object classes each. We also present a method for a robot to search for an object using the learned model and the contextual information available from the current labelings of the scene. We applied this algorithm successfully on a mobile robot for the task of finding 12 object classes in 10 different offices and achieved a precision of 97.56% with 78.43% recall.1

[1]  Ashutosh Saxena,et al.  Co-evolutionary predictors for kinematic pose inference from RGBD images , 2012, GECCO '12.

[2]  Richard Szeliski,et al.  A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Vladimir Kolmogorov,et al.  Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[6]  Thorsten Joachims,et al.  Labeling 3D scenes for Personal Assistant Robots , 2011, ArXiv.

[7]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[8]  Vladimir G. Kim,et al.  Shape-based recognition of 3D point clouds in urban environments , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Roman Shapovalov,et al.  Cutting-Plane Training of Non-associative Markov Network for 3D Point Cloud Segmentation , 2011, 2011 International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission.

[10]  Andrew Y. Ng,et al.  Integrating Visual and Range Data for Robotic Object Detection , 2008, ECCV 2008.

[11]  Tal Arbel,et al.  Efficient Discriminant Viewpoint Selection for Active Bayesian Recognition , 2006, International Journal of Computer Vision.

[12]  Tal Arbel,et al.  A fast discriminant approach to active object recognition and pose estimation , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[13]  Siddhartha S. Srinivasa,et al.  Structure discovery in multi-modal data: A region-based approach , 2011, 2011 IEEE International Conference on Robotics and Automation.

[14]  Tsuhan Chen,et al.  Toward Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[16]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[17]  Nico Blodow,et al.  Towards 3D Point cloud based object maps for household environments , 2008, Robotics Auton. Syst..

[18]  Ashutosh Saxena,et al.  Learning the right model: Efficient max-margin learning in Laplacian CRFs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[20]  Pittsburgh,et al.  The MOPED framework: Object recognition and pose estimation for manipulation , 2011 .

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Ales Leonardis,et al.  A framework for visual-context-aware object detection in still images , 2010, Comput. Vis. Image Underst..

[23]  Joachim Denzler,et al.  Information Theoretic Sensor Data Selection for Active Object Recognition and State Estimation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Quoc V. Le,et al.  High-accuracy 3D sensing for mobile manipulation: Improving object detection and door opening , 2009, 2009 IEEE International Conference on Robotics and Automation.

[25]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[26]  Pierre Hansen,et al.  Roof duality, complementation and persistency in quadratic 0–1 optimization , 1984, Math. Program..

[27]  Martial Hebert,et al.  Classifier fusion for outdoor obstacle detection , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[28]  Tsuhan Chen,et al.  $\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding , 2011, NIPS.

[29]  Ashutosh Saxena,et al.  Cascaded Classification Models: Combining Models for Holistic Scene Understanding , 2008, NIPS.

[30]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[31]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[32]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[33]  Luc Van Gool,et al.  Dynamic 3D Scene Analysis from a Moving Vehicle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  D. Fox,et al.  Classification and Semantic Mapping of Urban Environments , 2011, Int. J. Robotics Res..

[35]  James J. Little,et al.  Viewpoint detection models for sequential embodied object category recognition , 2010, 2010 IEEE International Conference on Robotics and Automation.

[36]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[37]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[38]  Jianxiong Xiao,et al.  Multiple view semantic segmentation for street view images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Martial Hebert,et al.  Natural terrain classification using three‐dimensional ladar data for ground robot mobility , 2006, J. Field Robotics.

[40]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[41]  Joel W. Burdick,et al.  A probabilistic framework for object search with 6-DOF pose estimation , 2011, Int. J. Robotics Res..

[42]  Martial Hebert,et al.  3-D scene analysis via sequenced predictions over points and regions , 2011, 2011 IEEE International Conference on Robotics and Automation.

[43]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[44]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[45]  O. Barinova,et al.  NON-ASSOCIATIVE MARKOV NETWORKS FOR 3D POINT CLOUD CLASSIFICATION , 2010 .

[46]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[47]  Martial Hebert,et al.  Onboard contextual classification of 3-D point clouds with learned high-order Markov Random Fields , 2009, 2009 IEEE International Conference on Robotics and Automation.

[48]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[49]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[50]  Daniel Huber,et al.  Using Context to Create Semantic 3D Models of Indoor Environments , 2010, BMVC.

[51]  Tsuhan Chen,et al.  Robotic Object Detection: Learning to Improve the Classifiers Using Sparse Graphs for Path Planning , 2011, IJCAI.

[52]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[53]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[54]  Ben Taskar,et al.  Discriminative learning of Markov random fields for segmentation of 3D scan data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[55]  Dieter Fox,et al.  Sparse distance learning for object recognition combining RGB and depth information , 2011, 2011 IEEE International Conference on Robotics and Automation.

[56]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.