Basic level scene understanding: from labels to structure and beyond

An early goal of computer vision was to build a system that could automatically understand a 3D scene just by looking. This requires not only the ability to extract 3D information from image information alone, but also to handle the large variety of different environments that comprise our visual world. This paper summarizes our recent efforts toward these goals. First, we describe the SUN database, which is a collection of annotated images spanning 908 different scene categories. This database allows us to systematically study the space of possible everyday scenes and to establish a benchmark for scene and object recognition. We also explore ways of coping with the variety of viewpoints within these scenes. For this, we have introduced a database of 360° panoramic images for many of the scene categories in the SUN database and have explored viewpoint recognition within the environments. Finally, we describe steps toward a unified 3D parsing of everyday scenes: (i) the ability to localize geometric primitives in images, such as cuboids and cylinders, which often comprise many everyday objects, and (ii) an integrated system to extract the 3D structure of the scene and objects depicted in an image.

[1]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[2]  Jianxiong Xiao,et al.  Image memorability and visual inception , 2012, SIGGRAPH Asia Technical Briefs.

[3]  Antonio Torralba,et al.  Notes on image annotation , 2012, ArXiv.

[4]  Jianxiong Xiao,et al.  What makes an image memorable? , 2011, CVPR 2011.

[5]  Jianxiong Xiao,et al.  Localizing 3D cuboids in single-view images , 2012, NIPS.

[6]  Krista A. Ehinger,et al.  Estimating scene typicality from human ratings and image features , 2011, CogSci.

[7]  Jianxiong Xiao,et al.  Multiple view semantic segmentation for street view images , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Jianxiong Xiao,et al.  Reconstructing the World’s Museums , 2012, International Journal of Computer Vision.

[9]  Jianxiong Xiao,et al.  Memorability of Image Regions , 2012, NIPS.

[10]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[11]  Krista A. Ehinger,et al.  Recognizing scene viewpoint using panoramic place representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Jianxiong Xiao,et al.  Image-based façade modeling , 2008, ACM Trans. Graph..

[13]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[16]  Jianxiong Xiao,et al.  Image-based street-side city modeling , 2009, ACM Trans. Graph..

[17]  Martial Hebert,et al.  Data-Driven Scene Understanding from 3D Models , 2012, BMVC.