SemanticPaint

We present a new interactive and online approach to 3D scene understanding. Our system, SemanticPaint, allows users to simultaneously scan their environment whilst interactively segmenting the scene simply by reaching out and touching any desired object or surface. Our system continuously learns from these segmentations, and labels new unseen parts of the environment. Unlike offline systems where capture, labeling, and batch learning often take hours or even days to perform, our approach is fully online. This provides users with continuous live feedback of the recognition during capture, allowing to immediately correct errors in the segmentation and/or learning—a feature that has so far been unavailable to batch and offline methods. This leads to models that are tailored or personalized specifically to the user's environments and object classes of interest, opening up the potential for new applications in augmented reality, interior design, and human/robot navigation. It also provides the ability to capture substantial labeled 3D datasets for training large-scale visual recognition systems.

[1]  Ke Xie,et al.  A search-classify approach for cluttered indoor scene understanding , 2012, ACM Trans. Graph..

[2]  Silvio Savarese,et al.  Indoor Scene Understanding with Geometric and Semantic Contexts , 2014, International Journal of Computer Vision.

[3]  Peter Kontschieder,et al.  Structured class-labels in random forests for semantic image labelling , 2011, 2011 International Conference on Computer Vision.

[4]  Thorsten Joachims,et al.  Contextually guided semantic labeling and search for three-dimensional point clouds , 2013, Int. J. Robotics Res..

[5]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[6]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[7]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[8]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[10]  Jan-Michael Frahm,et al.  Detailed Real-Time Urban 3D Reconstruction from Video , 2007, International Journal of Computer Vision.

[11]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Vincent Lepetit,et al.  Keypoint recognition using randomized trees , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jianxiong Xiao,et al.  A 2d + 3d rich data approach to scene understanding , 2013 .

[14]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[15]  Aaron Hertzmann,et al.  Learning 3D mesh segmentation and labeling , 2010, ACM Trans. Graph..

[16]  Andrew Owens,et al.  SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[18]  P. J. Narayanan,et al.  CUDA cuts: Fast graph cuts on the GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Antonio Criminisi,et al.  Decision Forests for Computer Vision and Medical Image Analysis , 2013, Advances in Computer Vision and Pattern Recognition.

[20]  Paul Newman,et al.  A generative framework for fast urban labeling using spatial and temporal context , 2009, Auton. Robots.

[21]  Fei-Fei Li,et al.  Object discovery in 3D scenes via shape analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[22]  Aly A. Farag,et al.  SHREC'13 Track: Retrieval of Objects Captured with Low-Cost Depth-Sensing Cameras , 2013, 3DOR@Eurographics.

[23]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[24]  Steven M. Seitz,et al.  The Visual Turing Test for Scene Reconstruction , 2013, 2013 International Conference on 3D Vision.

[25]  Marc Pollefeys,et al.  Joint 3D Scene Reconstruction and Class Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[27]  David W. Murray,et al.  Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[28]  Shi-Min Hu,et al.  Structure recovery by part assembly , 2012, ACM Trans. Graph..

[29]  John Hart,et al.  ACM Transactions on Graphics , 2004, SIGGRAPH 2004.

[30]  Olaf Kähler,et al.  Efficient 3D Scene Labeling Using Fields of Trees , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Luigi di Stefano,et al.  Joint Detection, Tracking and Mapping by Semantic Bundle Adjustment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Horst Bischof,et al.  On-line Random Forests , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[33]  Dieter Fox,et al.  Toward online 3-D object segmentation and mapping , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[34]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[35]  Jun Wang,et al.  From Low-Cost Depth Sensors to CAD: Cross-Domain 3D Shape Retrieval via Regression Tree Fields , 2014, ECCV.

[36]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[37]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[38]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, SIGGRAPH 2011.

[39]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[40]  Babak Taati,et al.  Difference of Normals as a Multi-scale Operator in Unorganized Point Clouds , 2012, 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission.

[41]  Lawrence G. Roberts,et al.  Machine Perception of Three-Dimensional Solids , 1963, Outstanding Dissertations in the Computer Sciences.

[42]  Jiawen Chen,et al.  Scalable real-time volumetric surface reconstruction , 2013, ACM Trans. Graph..

[43]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[44]  Leonidas J. Guibas,et al.  Acquiring 3D indoor environments with variability and repetition , 2012, ACM Trans. Graph..

[45]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[46]  Marc Levoy,et al.  Real-time 3D model acquisition , 2002, ACM Trans. Graph..

[47]  Hui Lin,et al.  Semantic decomposition and reconstruction of residential scenes from LiDAR data , 2013, ACM Trans. Graph..

[48]  Sebastian Nowozin,et al.  Decision Jungles: Compact and Rich Models for Classification , 2013, NIPS.

[49]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[50]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[51]  Nathan Silberman,et al.  Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[52]  Olaf Kähler,et al.  A Framework for the Volumetric Integration of Depth Images , 2014, ArXiv.

[53]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[54]  Stephen DiVerdi,et al.  Learning part-based templates from large collections of 3D shapes , 2013, ACM Trans. Graph..

[55]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[56]  Hugh F. Durrant-Whyte,et al.  Combining Object Recognition and SLAM for Extended Map Representations , 2006, ISER.

[57]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[58]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[59]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[60]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Roberto Cipolla,et al.  Multi Scale Shape Index for 3D Object Recognition , 2013, SSVM.

[62]  Ali Shahrokni,et al.  Mesh Based Semantic Modelling for Indoor and Outdoor Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Ali Shahrokni,et al.  Urban 3D semantic modelling using stereo vision , 2013, 2013 IEEE International Conference on Robotics and Automation.

[64]  Vibhav Vineet,et al.  ImageSpirit: Verbal Guided Image Parsing , 2013, ACM Trans. Graph..

[65]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[66]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[67]  Andrew E. Johnson,et al.  Spin-Images: A Representation for 3-D Surface Matching , 1997 .

[68]  Olaf Kähler,et al.  Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices , 2015, IEEE Transactions on Visualization and Computer Graphics.

[69]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[70]  Masaki Aono,et al.  SHREC ’ 13 : Retrieval of objects captured with low-cost depth-sensing cameras , 2013 .

[71]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[72]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[73]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[74]  Katsushi Ikeuchi,et al.  Scene Understanding by Reasoning Stability and Safety , 2015, International Journal of Computer Vision.

[75]  Luc Van Gool,et al.  Interactive object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Heiko Hirschmüller,et al.  Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[77]  Rafael C. González,et al.  Digital image processing, 3rd Edition , 2008 .

[78]  Kun Zhou,et al.  An interactive approach to semantic modeling of indoor scenes with an RGBD camera , 2012, ACM Trans. Graph..

[79]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[80]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[81]  Thomas A. Funkhouser,et al.  A benchmark for 3D mesh segmentation , 2009, ACM Trans. Graph..

[82]  Shahram Izadi,et al.  MonoFusion: Real-time 3D reconstruction of small scenes with a single web camera , 2013, 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[83]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[84]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[85]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[86]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[87]  W. F. Clocksin,et al.  Joint Optimization for Object Class Segmentation and Dense Stereo Reconstruction , 2012, International Journal of Computer Vision.

[88]  Sanja Fidler,et al.  Holistic Scene Understanding for 3D Object Detection with RGBD Cameras , 2013, 2013 IEEE International Conference on Computer Vision.

[89]  Daniel Cohen-Or,et al.  Contextual Part Analogies in 3D Objects , 2010, International Journal of Computer Vision.

[90]  Marc Levoy,et al.  The digital Michelangelo project: 3D scanning of large statues , 2000, SIGGRAPH.

[91]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[92]  Dieter Fox,et al.  RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[93]  Alexei A. Efros,et al.  Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics , 2010, ECCV.

[94]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[95]  Silvio Savarese,et al.  3D Scene Understanding by Voxel-CRF , 2013, 2013 IEEE International Conference on Computer Vision.

[96]  Jörg Stückler,et al.  Dense real-time mapping of object-class semantics from RGB-D video , 2013, Journal of Real-Time Image Processing.

[97]  Nassir Navab,et al.  Model globally, match locally: Efficient and robust 3D object recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.