论文信息 - ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available – current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval.

[1] Dieter Fox,et al. RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Jan Dirk Wegner,et al. Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-resolution Model for Multi-class Volumetric Labeling , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Pushmeet Kohli,et al. When Can We Use KinectFusion for Ground Truth Acquisition , 2012 .

[4] Matthias Nießner,et al. Learning to Navigate the Energy Landscape , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[5] Matthias Nießner,et al. Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[6] Jonathan T. Barron,et al. A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7] Marc Pollefeys,et al. Automatic Registration of RGB-D Scans via Salient Directions , 2013, 2013 IEEE International Conference on Computer Vision.

[8] Silvio Savarese,et al. Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[9] Andrew W. Fitzgibbon,et al. KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[10] David A. Forsyth,et al. Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Sebastian Thrun,et al. Unsupervised Intrinsic Calibration of Depth Sensors via SLAM , 2013, Robotics: Science and Systems.

[12] Thomas A. Funkhouser,et al. Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Michael Firman,et al. RGBD Datasets: Past, Present and Future , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14] Silvio Savarese,et al. 3D Scene Understanding by Voxel-CRF , 2013, 2013 IEEE International Conference on Computer Vision.

[15] Roberto Cipolla,et al. SceneNet: Understanding Real World Indoor Scenes With Synthetic Data , 2015, ArXiv.

[16] Kai Oliver Arras,et al. People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17] Olga Sorkine-Hornung,et al. Object detection and classification from large‐scale cluttered indoor scans , 2014, Comput. Graph. Forum.

[18] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[21] Jun Wang,et al. Online Reconstruction of Indoor Scenes from RGB-D Streams , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Abhinav Gupta,et al. Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[23] Jianxiong Xiao,et al. 3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Mohammed Bennamoun,et al. On the Repeatability and Quality of Keypoints for Local Feature-based 3D Object Retrieval from Cluttered Scenes , 2009, International Journal of Computer Vision.

[25] Jitendra Malik,et al. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Jianxiong Xiao,et al. Sliding Shapes for 3D Object Detection in Depth Images , 2014, ECCV.

[27] Alessio Del Bue,et al. Re-identification with RGB-D Sensors , 2012, ECCV Workshops.

[28] Jianxiong Xiao,et al. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Bhaskara Marthi,et al. Object disappearance for object discovery , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[31] Bart Selman,et al. Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[32] Olaf Kähler,et al. Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure , 2016, ECCV.

[33] Markus Vincze,et al. Segmentation of unknown objects in indoor environments , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34] Marc Pollefeys,et al. Multi-Label Semantic 3D Reconstruction Using Voxel Blocks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[35] Sébastien Marcel,et al. Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[36] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[37] Stephen J. McKenna,et al. Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[38] Matthias Nießner,et al. VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[39] Derek Hoiem,et al. Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[40] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42] Cristian Sminchisescu,et al. Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[43] Duc Thanh Nguyen,et al. A Robust 3D-2D Interactive Tool for Scene Segmentation and Annotation , 2016, IEEE Transactions on Visualization and Computer Graphics.

[44] Daniel P. Huttenlocher,et al. Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[45] Vladlen Koltun,et al. A Large Dataset of Object Scans , 2016, ArXiv.

[46] Jian Zhang,et al. Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors , 2013, 2013 IEEE International Conference on Computer Vision.

[47] Sanja Fidler,et al. Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[48] Sebastian Scherer,et al. VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49] Matthias Nießner,et al. Shading-based refinement on volumetric signed distance functions , 2015, ACM Trans. Graph..

[50] Jean-Luc Dugelay,et al. KinectFaceDB: A Kinect Database for Face Recognition , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[51] Bastian Leibe,et al. Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[52] Leonidas J. Guibas,et al. Database‐Assisted Object Retrieval for Real‐Time 3D Reconstruction , 2015, Comput. Graph. Forum.

[53] Andrew Owens,et al. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[54] Marc Levoy,et al. A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[55] Roland Siegwart,et al. Tracking a depth camera: Parameter exploration for fast ICP , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[56] Jianxiong Xiao,et al. SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Helena M. Mentis,et al. Instructing people for training gestural interactive systems , 2012, CHI.

[58] Andrew W. Fitzgibbon,et al. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59] Matthias Nießner,et al. PiGraphs , 2016, ACM Trans. Graph..

[60] Vladlen Koltun,et al. Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] John J. Leonard,et al. Kintinuous: Spatially Extended KinectFusion , 2012, AAAI 2012.

[62] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[63] Markus Vincze,et al. A Global Hypotheses Verification Method for 3D Object Recognition , 2012, ECCV.

[64] Luca Iocchi,et al. Non-parametric calibration for depth sensors , 2015 .

[65] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66] Andrew J. Davison,et al. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[67] Silvio Savarese,et al. 3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Matthias Nießner,et al. Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Martial Hebert,et al. Unfolding an Indoor Origami World , 2014, ECCV.

[70] Subhransu Maji,et al. Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[71] Leonidas J. Guibas,et al. Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Jitendra Malik,et al. Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[73] Bingbing Ni,et al. RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[74] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[75] Olaf Kähler,et al. Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices , 2015, IEEE Transactions on Visualization and Computer Graphics.

[76] Stefan Leutenegger,et al. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[77] Kai Oliver Arras,et al. People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[78] Matthias Nießner,et al. BundleFusion , 2016, TOGS.

[79] Stefan Leutenegger,et al. ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[80] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[81] Wolfram Burgard,et al. An evaluation of the RGB-D SLAM system , 2012, 2012 IEEE International Conference on Robotics and Automation.

[82] Matthias Nießner,et al. SemanticPaint , 2015, ACM Trans. Graph..

[83] Stefan Leutenegger,et al. ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[84] Jiawen Chen,et al. Scalable real-time volumetric surface reconstruction , 2013, ACM Trans. Graph..

[85] Antonio Torralba,et al. LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[86] Martial Hebert,et al. Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[87] Duc Thanh Nguyen,et al. SceneNN: A Scene Meshes Dataset with aNNotations , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[88] Bogdan Kwolek,et al. Fall detection using ceiling-mounted 3D depth camera , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[89] Ling Shao,et al. Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[90] Siddhartha S. Srinivasa,et al. Chisel: Real Time Large Scale 3D Reconstruction Onboard a Mobile Device using Spatially Hashed Signed Distance Fields , 2015, Robotics: Science and Systems.

[91] Nathan Silberman,et al. Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[92] Fei-Fei Li,et al. Object discovery in 3D scenes via shape analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[93] Pat Hanrahan,et al. SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..