Scene Understanding by Reasoning Stability and Safety

This paper presents a new perspective for 3D scene understanding by reasoning object stability and safety using intuitive mechanics. Our approach utilizes a simple observation that, by human design, objects in static scenes should be stable in the gravity field and be safe with respect to various physical disturbances such as human activities. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Given a 3D point cloud captured for a static scene by depth cameras, our method consists of three steps: (i) recovering solid 3D volumetric primitives from voxels; (ii) reasoning stability by grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior; and (iii) reasoning safety by evaluating the physical risks for objects under physical disturbances, such as human activity, wind or earthquakes. We adopt a novel intuitive physics model and represent the energy landscape of each primitive and object in the scene by a disconnectivity graph (DG). We construct a contact graph with nodes being 3D volumetric primitives and edges representing the supporting relations. Then we adopt a Swendson–Wang Cuts algorithm to partition the contact graph into groups, each of which is a stable object. In order to detect unsafe objects in a static scene, our method further infers hidden and situated causes (disturbances) in the scene, and then introduces intuitive physical mechanics to predict possible effects (e.g., falls) as consequences of the disturbances. In experiments, we demonstrate that the algorithm achieves a substantially better performance for (i) object segmentation, (ii) 3D volumetric recovery, and (iii) scene understanding with respect to other state-of-the-art methods. We also compare the safety prediction from the intuitive mechanics model with human judgement.

[1]  I. Biederman,et al.  Scene perception: Detecting and judging objects undergoing relational violations , 1982, Cognitive Psychology.

[2]  King-Sun Fu,et al.  Parsing and Translation of (Attributed) Expansive Graph Languages for Scene Analysis , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  David J. Kriegman,et al.  Let Them Fall Where They May: Capture Regions of Curved Objects and Polyhedra , 1997, Int. J. Robotics Res..

[4]  David B. Cooper,et al.  The 3L Algorithm for Fitting Implicit Polynomial Curves and Surfaces to Data , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  D. Wales Energy Landscapes by David Wales , 2004 .

[6]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[7]  Thierry Fraichard,et al.  Safe motion planning in dynamic environments , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Katsushi Ikeuchi,et al.  Adaptively merging large-scale range data with reflectance properties , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[10]  Adrian Barbu,et al.  Generalizing Swendsen-Wang to sampling arbitrary posterior probabilities , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  A. Heuer Energy Landscapes. Applications to Clusters, Biomolecules and Glasses. By David J. Wales. , 2005 .

[12]  Marco Attene,et al.  Hierarchical mesh segmentation based on fitting primitives , 2006, The Visual Computer.

[13]  A. Yuille,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  R. Podgornik Energy Landscapes: Applications to Clusters, Biomolecules and Glasses (Cambridge Molecular Science) , 2007 .

[15]  Andreas Birk,et al.  Fast plane detection and polygonalization in noisy 3D range images , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Thomas A. Funkhouser,et al.  A benchmark for 3D mesh segmentation , 2009, ACM Trans. Graph..

[17]  Richard Szeliski,et al.  Manhattan-world stereo , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  T. Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Katsushi Ikeuchi,et al.  An Adaptive and Stable Method for Fitting Implicit Polynomial Curves and Surfaces , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[21]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[22]  H. Bülthoff,et al.  Perceived object stability is affected by the internal representation of gravity , 2010 .

[23]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[24]  Jessica B. Hamrick,et al.  Probabilistic internal physics models guide judgments about object dynamics , 2011, CogSci.

[25]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[26]  Jessica B. Hamrick Internal physics models guide probabilistic judgments about object dynamics , 2011 .

[27]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[28]  Song-Chun Zhu,et al.  Image Parsing via Stochastic Scene Grammar , 2011 .

[29]  Maxim Likhachev,et al.  SIPP: Safe interval path planning for dynamic environments , 2011, 2011 IEEE International Conference on Robotics and Automation.

[30]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[31]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[32]  Thorsten Joachims,et al.  Contextually Guided Semantic Labeling and Search for 3D Point Clouds , 2011, ArXiv.

[33]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[34]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[35]  Ke Xie,et al.  A search-classify approach for cluttered indoor scene understanding , 2012, ACM Trans. Graph..

[36]  Kun Zhou,et al.  An interactive approach to semantic modeling of indoor scenes with an RGBD camera , 2012, ACM Trans. Graph..

[37]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[38]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[39]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Thorsten Joachims,et al.  Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[41]  Yun Jiang,et al.  Infinite Latent Conditional Random Fields for Modeling Environments through Humans , 2013, Robotics: Science and Systems.

[42]  Tsuhan Chen,et al.  3D-Based Reasoning with Blocks, Support, and Stability , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Katsushi Ikeuchi,et al.  Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Fei-Fei Li,et al.  Object discovery in 3D scenes via shape analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[46]  Katsushi Ikeuchi,et al.  Detecting potential falling objects by inferring human action and natural disturbance , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Pat Hanrahan,et al.  SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..

[48]  Kun Zhou,et al.  Imagining the unseen , 2014, ACM Trans. Graph..

[49]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[50]  Ashutosh Saxena,et al.  Hierarchical Semantic Labeling for Task-Relevant RGB-D Perception , 2014, Robotics: Science and Systems.

[51]  Tsuhan Chen,et al.  3D Reasoning from Blocks to Stability , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.