Integrating Function , Geometry , Appearance for Scene Parsing

In this paper, we present a Stochastic Scene Grammar (SSG) for parsing 2D indoor images into 3D scene layouts. Our grammar model integrates object functionality, 3D object geometry, and their 2D image appearance in a Function-Geometry-Appearance (FGA) hierarchy. In contrast to the prevailing approach in the literature which recognizes scenes and detects objects through appearance-based classification using machine learning techniques, our method takes a different perspective to scene understanding and recognizes objects and scenes by reasoning their functionality. Functionality is an essential property which often defines the categories of objects and scenes, and decides the design of geometry and scene layout. For example, a sofa is for people to sit comfortably, and a kitchen is a space for people to prepare food with various objects. Our SSG formulates object functionality and contextual relations between objects and imagined human poses in a joint probability distribution in the FGA hierarchy. The latter includes both functional concepts (the scene category, functional groups, functional objects, functional parts) and geometric entities (3D/2D/1D shape primitives). The decomposition of the grammar is terminated on the bottom-up detected lines and regions. We use a Markov chain Monte Carlo (MCMC) algorithm to optimize the Bayesian a posteriori probability and the output parse tree includes a 3D description of the 2D image in the FGA hierarchy. Experimental results on two Yibiao Zhao University of California, Los Angeles (UCLA), USA E-mail: ybzhao@ucla.edu www.yibiaozhao.com Song-Chun Zhu University of California, Los Angeles (UCLA), USA E-mail: sczhu@stat.ucla.edu http://www.stat.ucla.edu/~sczhu challenging indoor datasets demonstrate that the proposed approach not only significantly widens the scope of indoor scene parsing from traditional scene segmentation, labeling, and 3D reconstruction to functional object recognition, but also yields improved overall performance.

[1]  J. Rosenthal,et al.  Markov Chain Monte Carlo , 2018 .

[2]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[3]  Niloy J. Mitra,et al.  Creating consistent scene graphs using a probabilistic grammar , 2014, ACM Trans. Graph..

[4]  Martial Hebert,et al.  Unfolding an Indoor Origami World , 2014, ECCV.

[5]  Yinda Zhang,et al.  PanoContext: A Whole-Room 3D Context Model for Panoramic Scene Understanding , 2014, ECCV.

[6]  Jianxiong Xiao,et al.  Sliding Shapes for 3D Object Detection in Depth Images , 2014, ECCV.

[7]  Leonidas J. Guibas,et al.  Shape2Pose , 2014, ACM Trans. Graph..

[8]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Ce Liu,et al.  Scene Collaging: Analysis and Synthesis of Natural Images with Semantic Layers , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Martial Hebert,et al.  3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Sanja Fidler,et al.  Box in the Box: Joint 3D Layout and Object Reasoning from Single Images , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Jessica B. Hamrick,et al.  Simulation as an engine of physical scene understanding , 2013, Proceedings of the National Academy of Sciences.

[16]  Joshua B. Tenenbaum,et al.  Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs , 2013, NIPS.

[17]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Kobus Barnard,et al.  Understanding Bayesian Rooms Using Composite 3D Object Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Song-Chun Zhu,et al.  Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Deva Ramanan,et al.  Predicting Functional Regions on Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Silvio Savarese,et al.  Understanding Indoor Scenes Using 3D Geometric Phrases , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Svetlana Lazebnik,et al.  Finding Things: Image Parsing with Regions and Per-Exemplar Detectors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Deva Ramanan,et al.  Analyzing 3D Objects in Cluttered Images , 2012, NIPS.

[25]  Sven J. Dickinson,et al.  3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model , 2012, NIPS.

[26]  Jianxiong Xiao,et al.  Localizing 3D cuboids in single-view images , 2012, NIPS.

[27]  Song-Chun Zhu,et al.  Hierarchical Space Tiling for Scene Modeling , 2012, ACCV.

[28]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[29]  Peter V. Gehler,et al.  3D2PM - 3D Deformable Part Models , 2012, ECCV.

[30]  Raquel Urtasun,et al.  Efficient Exact Inference for 3D Indoor Scene Understanding , 2012, ECCV.

[31]  Alexei A. Efros,et al.  Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[32]  Pat Hanrahan,et al.  Synthesizing open worlds with constraints using locally annealed reversible jump MCMC , 2012, ACM Trans. Graph..

[33]  David A. Forsyth,et al.  Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Wenze Hu,et al.  Learning 3D object templates by hierarchical quantization of geometry and appearance spaces , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Daniel Fried,et al.  Bayesian geometric modeling of indoor scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Marc Pollefeys,et al.  Efficient structured prediction for 3D indoor scene understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Pedro F. Felzenszwalb,et al.  Reconfigurable models for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Silvio Savarese,et al.  Estimating the aspect layout of object categories , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Song-Chun Zhu,et al.  Primal Sketch: Integrating Texture and Structure , 2011 .

[41]  Song-Chun Zhu,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Joseph Schlecht,et al.  Sampling bedrooms , 2011, CVPR 2011.

[43]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[44]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[45]  J. B. Tenenbaum,et al.  How to Grow a Mind: Statistics, Structure, and Abstraction , 2011, Science.

[46]  Takeo Kanade,et al.  Estimating Spatial Layout of Rooms using Volumetric Reasoning about Objects and Surfaces , 2010, NIPS.

[47]  Stephen Gould,et al.  Discriminative learning with latent variables for cluttered indoor scene understanding , 2010, CACM.

[48]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[49]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[50]  David A. Forsyth,et al.  Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[51]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[53]  Antonio Torralba,et al.  Exploiting hierarchical context on a large database of object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Rafael Grompone von Gioi,et al.  LSD: A Fast Line Segment Detector with a False Detection Control , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Derek Hoiem,et al.  Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[56]  T. Kanade,et al.  Geometric reasoning for single image structure recovery , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Ashutosh Saxena,et al.  Make3D: Learning 3D Scene Structure from a Single Still Image , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  三嶋 博之 The theory of affordances , 2008 .

[59]  Jake Porway,et al.  A hierarchical and contextual model for aerial image understanding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Honglak Lee,et al.  Automatic Single-Image 3d Reconstructions of Indoor Manhattan World Scenes , 2007, ISRR.

[61]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[62]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, ACM Trans. Graph..

[63]  Feng Han,et al.  Bayesian reconstruction of 3D shapes and scenes from a single image , 2003, First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, 2003. HLK 2003..

[64]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[65]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[66]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[67]  Antonio Criminisi,et al.  Single View Metrology , 2000, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[68]  K. Bowyer,et al.  Dissertation Abstract , 1993, Journal of Cognitive Education and Psychology.

[69]  O. Firschein,et al.  Syntactic pattern recognition and applications , 1983, Proceedings of the IEEE.

[70]  Martial Hebert,et al.  Data-Driven Scene Understanding from 3D Models , 2012, BMVC.

[71]  Song-Chun Zhu,et al.  Image Parsing via Stochastic Scene Grammar , 2011 .

[72]  Lisa M Oakes,et al.  Function revisited: how infants construe functional features in their representation of objects. , 2008, Advances in child development and behavior.

[73]  Andrew Zisserman,et al.  Multiple view geometry in computer vision (2. ed.) , 2006 .

[74]  Ehud Rivlin,et al.  Functional 3D Object Classification Using Simulation of Embodied Agent , 2006, BMVC.

[75]  Pedro F. Felzenszwalb,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[76]  Song-Chun Zhu,et al.  Image Segmentation by Data-driven Markov Chain Monte Carlo , 2002 .

[77]  Noname manuscript No. (will be inserted by the editor) A Numerical Study of the Bottom-up and Top-down Inference Processes in And-Or Graphs , 2022 .

[78]  Christopher K. I. Williams,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) The PASCAL Visual Object Classes (VOC) Challenge , 2022 .