论文信息 - SUN RGB-D: A RGB-D scene understanding benchmark suite

SUN RGB-D: A RGB-D scene understanding benchmark suite

Although RGB-D sensors have enabled major break-throughs for several vision tasks, such as 3D reconstruction, we have not attained the same level of success in high-level scene understanding. Perhaps one of the main reasons is the lack of a large-scale benchmark with 3D annotations and 3D evaluation metrics. In this paper, we introduce an RGB-D benchmark suite for the goal of advancing the state-of-the-arts in all major scene understanding tasks. Our dataset is captured by four different sensors and contains 10,335 RGB-D images, at a similar scale as PASCAL VOC. The whole dataset is densely annotated and includes 146,617 2D polygons and 64,595 3D bounding boxes with accurate object orientations, as well as a 3D room layout and scene category for each image. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias.

[1] Jean-Yves Bouguet,et al. Camera calibration toolbox for matlab , 2001 .

[2] Antonio Torralba,et al. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[3] Antonio Torralba,et al. LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[4] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5] Derek Hoiem,et al. Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[7] Mohammed Bennamoun,et al. On the Repeatability and Quality of Keypoints for Local Feature-based 3D Object Retrieval from Cluttered Scenes , 2009, International Journal of Computer Vision.

[8] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Antonio Torralba,et al. Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Krista A. Ehinger,et al. SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12] David A. Forsyth,et al. Thinking Inside the Box: Using Appearance Models and Context Based on Room Geometry , 2010, ECCV.

[13] Alexei A. Efros,et al. Unbiased look at dataset bias , 2011, CVPR 2011.

[14] Dieter Fox,et al. A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[15] Kai Oliver Arras,et al. People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16] Andrew W. Fitzgibbon,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[17] Roland Siegwart,et al. Tracking a depth camera: Parameter exploration for fast ICP , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18] C. Ghiglieri. Fergus , 2010, The Missouri Review.

[19] Jonathan T. Barron,et al. A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[20] Bingbing Ni,et al. RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[21] Kai Oliver Arras,et al. People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22] Antonio Torralba,et al. SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Cristian Sminchisescu,et al. Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[24] Thorsten Joachims,et al. Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[25] Bart Selman,et al. Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[26] Alexei A. Efros,et al. Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[27] Nathan Silberman,et al. Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[28] Andrew W. Fitzgibbon,et al. KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[29] Bhaskara Marthi,et al. Object disappearance for object discovery , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30] Vincent Lepetit,et al. Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes , 2012, ACCV.

[31] Luc Van Gool,et al. Random Forests for Real Time 3D Face Analysis , 2012, International Journal of Computer Vision.

[32] Alessio Del Bue,et al. Re-identification with RGB-D Sensors , 2012, ECCV Workshops.

[33] Michael Beetz,et al. Distinctive texture features from perspective-invariant keypoints , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[34] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35] David A. Forsyth,et al. Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Pushmeet Kohli,et al. When Can We Use KinectFusion for Ground Truth Acquisition , 2012 .

[37] Markus Vincze,et al. A Global Hypotheses Verification Method for 3D Object Recognition , 2012, ECCV.

[38] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[40] Helena M. Mentis,et al. Instructing people for training gestural interactive systems , 2012, CHI.

[41] Dieter Fox,et al. RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Markus Vincze,et al. Segmentation of unknown objects in indoor environments , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[43] Andrew W. Fitzgibbon,et al. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44] Derek Hoiem,et al. Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[45] Andrew Blake,et al. Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46] Ling Shao,et al. Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[47] Krista A. Ehinger,et al. Basic level scene understanding: categories, attributes and structures , 2013, Front. Psychol..

[48] Jian Zhang,et al. Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors , 2013, 2013 IEEE International Conference on Computer Vision.

[49] Thorsten Joachims,et al. Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[50] Tsuhan Chen,et al. 3D-Based Reasoning with Blocks, Support, and Stability , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51] Andrew Owens,et al. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[52] Jitendra Malik,et al. Intrinsic Scene Properties from a Single RGB-D Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53] Jianxiong Xiao,et al. A Linear Approach to Matching Cuboids in RGBD Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[54] Sébastien Marcel,et al. Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[55] Sanja Fidler,et al. Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[56] Jitendra Malik,et al. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Stephen J. McKenna,et al. User-adaptive models for recognizing food preparation activities , 2013, CEA '13.

[58] Martial Hebert,et al. Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[59] Stephen J. McKenna,et al. Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[60] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[61] Abhinav Gupta,et al. Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[62] Marc Pollefeys,et al. Automatic Registration of RGB-D Scans via Salient Directions , 2013, 2013 IEEE International Conference on Computer Vision.

[63] Jean-Luc Dugelay,et al. KinectFaceDB: A Kinect Database for Face Recognition , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[64] Olga Sorkine-Hornung,et al. Object detection and classification from large‐scale cluttered indoor scans , 2014, Comput. Graph. Forum.

[65] Jitendra Malik,et al. Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[66] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[67] A. Khosla,et al. 3D ShapeNets for 2.5D Object Recognition and Next-Best-View Prediction , 2014, ArXiv.

[68] Martial Hebert,et al. Unfolding an Indoor Origami World , 2014, ECCV.

[69] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[70] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71] Andrew J. Davison,et al. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[72] Hao Jiang. Finding Approximate Convex Shapes in RGBD Images , 2014, ECCV.

[73] Pieter Abbeel,et al. BigBIRD: A large-scale 3D database of object instances , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[74] Bogdan Kwolek,et al. Fall detection using ceiling-mounted 3D depth camera , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[75] Jianxiong Xiao,et al. 3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76] Song Wu,et al. 3 D ShapeNets : A Deep Representation for Volumetric Shape Modeling , 2015 .