ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available – current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval.

[1]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jan Dirk Wegner,et al.  Large-Scale Semantic 3D Reconstruction: An Adaptive Multi-resolution Model for Multi-class Volumetric Labeling , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pushmeet Kohli,et al.  When Can We Use KinectFusion for Ground Truth Acquisition , 2012 .

[4]  Matthias Nießner,et al.  Learning to Navigate the Energy Landscape , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[5]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[6]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7]  Marc Pollefeys,et al.  Automatic Registration of RGB-D Scans via Salient Directions , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[9]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[10]  David A. Forsyth,et al.  Recovering free space of indoor scenes from a single image , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Sebastian Thrun,et al.  Unsupervised Intrinsic Calibration of Depth Sensors via SLAM , 2013, Robotics: Science and Systems.

[12]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael Firman,et al.  RGBD Datasets: Past, Present and Future , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Silvio Savarese,et al.  3D Scene Understanding by Voxel-CRF , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Roberto Cipolla,et al.  SceneNet: Understanding Real World Indoor Scenes With Synthetic Data , 2015, ArXiv.

[16]  Kai Oliver Arras,et al.  People detection in RGB-D data , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Olga Sorkine-Hornung,et al.  Object detection and classification from large‐scale cluttered indoor scans , 2014, Comput. Graph. Forum.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[21]  Jun Wang,et al.  Online Reconstruction of Indoor Scenes from RGB-D Streams , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Abhinav Gupta,et al.  Building Part-Based Object Detectors via 3D Geometry , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Mohammed Bennamoun,et al.  On the Repeatability and Quality of Keypoints for Local Feature-based 3D Object Retrieval from Cluttered Scenes , 2009, International Journal of Computer Vision.

[25]  Jitendra Malik,et al.  Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jianxiong Xiao,et al.  Sliding Shapes for 3D Object Detection in Depth Images , 2014, ECCV.

[27]  Alessio Del Bue,et al.  Re-identification with RGB-D Sensors , 2012, ECCV Workshops.

[28]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Bhaskara Marthi,et al.  Object disappearance for object discovery , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[31]  Bart Selman,et al.  Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[32]  Olaf Kähler,et al.  Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure , 2016, ECCV.

[33]  Markus Vincze,et al.  Segmentation of unknown objects in indoor environments , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Marc Pollefeys,et al.  Multi-Label Semantic 3D Reconstruction Using Voxel Blocks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[35]  Sébastien Marcel,et al.  Spoofing in 2D face recognition with 3D masks and anti-spoofing with Kinect , 2013, 2013 IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS).

[36]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[37]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[38]  Matthias Nießner,et al.  VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[39]  Derek Hoiem,et al.  Support Surface Prediction in Indoor Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[42]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[43]  Duc Thanh Nguyen,et al.  A Robust 3D-2D Interactive Tool for Scene Segmentation and Annotation , 2016, IEEE Transactions on Visualization and Computer Graphics.

[44]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[45]  Vladlen Koltun,et al.  A Large Dataset of Object Scans , 2016, ArXiv.

[46]  Jian Zhang,et al.  Estimating the 3D Layout of Indoor Scenes and Its Clutter from Depth Sensors , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[49]  Matthias Nießner,et al.  Shading-based refinement on volumetric signed distance functions , 2015, ACM Trans. Graph..

[50]  Jean-Luc Dugelay,et al.  KinectFaceDB: A Kinect Database for Face Recognition , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[51]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Leonidas J. Guibas,et al.  Database‐Assisted Object Retrieval for Real‐Time 3D Reconstruction , 2015, Comput. Graph. Forum.

[53]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[54]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[55]  Roland Siegwart,et al.  Tracking a depth camera: Parameter exploration for fast ICP , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[56]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[58]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Matthias Nießner,et al.  PiGraphs , 2016, ACM Trans. Graph..

[60]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  John J. Leonard,et al.  Kintinuous: Spatially Extended KinectFusion , 2012, AAAI 2012.

[62]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[63]  Markus Vincze,et al.  A Global Hypotheses Verification Method for 3D Object Recognition , 2012, ECCV.

[64]  Luca Iocchi,et al.  Non-parametric calibration for depth sensors , 2015 .

[65]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[67]  Silvio Savarese,et al.  3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Matthias Nießner,et al.  Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Martial Hebert,et al.  Unfolding an Indoor Origami World , 2014, ECCV.

[70]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[71]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[73]  Bingbing Ni,et al.  RGBD-HuDaAct: A color-depth video database for human daily activity recognition , 2011, ICCV Workshops.

[74]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[75]  Olaf Kähler,et al.  Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices , 2015, IEEE Transactions on Visualization and Computer Graphics.

[76]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[77]  Kai Oliver Arras,et al.  People tracking in RGB-D data with on-line boosted target models , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[78]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[79]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[80]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[81]  Wolfram Burgard,et al.  An evaluation of the RGB-D SLAM system , 2012, 2012 IEEE International Conference on Robotics and Automation.

[82]  Matthias Nießner,et al.  SemanticPaint , 2015, ACM Trans. Graph..

[83]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[84]  Jiawen Chen,et al.  Scalable real-time volumetric surface reconstruction , 2013, ACM Trans. Graph..

[85]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[86]  Martial Hebert,et al.  Data-Driven 3D Primitives for Single Image Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[87]  Duc Thanh Nguyen,et al.  SceneNN: A Scene Meshes Dataset with aNNotations , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[88]  Bogdan Kwolek,et al.  Fall detection using ceiling-mounted 3D depth camera , 2015, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[89]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[90]  Siddhartha S. Srinivasa,et al.  Chisel: Real Time Large Scale 3D Reconstruction Onboard a Mobile Device using Spatially Hashed Signed Distance Fields , 2015, Robotics: Science and Systems.

[91]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[92]  Fei-Fei Li,et al.  Object discovery in 3D scenes via shape analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[93]  Pat Hanrahan,et al.  SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..