Retargetable AR: Context-aware Augmented Reality in Indoor Scenes based on 3D Scene Graph

We present Retargetable AR—a novel AR framework that yields an AR experience that is aware of scene contexts set in various real environments, achieving natural interaction between the virtual and real worlds. We characterize scene contexts with relationships among objects in 3D space. A context assumed by an AR content and a context formed by a real environment where users experience AR are represented as abstract graph representations, i.e. scene graphs. From RGB-D streams, our framework generates a volumetric map in which geometric and semantic information of a scene are integrated. Moreover, using the semantic map, we abstract scene objects as oriented bounding boxes and estimate their orientations. Then our framework constructs, in an online fashion, a 3D scene graph characterizing the context of a real environment for AR. The correspondence between the constructed graph and an AR scene graph denoting the context of AR content provides a semantically registered content arrangement, which facilitates natural interaction between the virtual and real worlds. We performed extensive evaluations on our prototype system through quantitative evaluation of the performance of the oriented bounding box estimation, subjective evaluation of the AR content arrangement based on constructed 3D scene graphs, and an online AR demonstration.

[1]  Eyal Ofek,et al.  SnapToReality: Aligning Augmented Reality to the Real World , 2016, CHI.

[2]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Eyal Ofek,et al.  VRoamer: Generating On-The-Fly VR Experiences While Walking inside Large, Unknown Real-World Building Environments , 2019, 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[5]  Robert B. Fisher,et al.  Interactive light source position estimation for augmented reality with an RGB‐D camera , 2017, Comput. Animat. Virtual Worlds.

[6]  Eyal Ofek,et al.  RealityCheck: Blending Virtual Environments with Situated Physical Reality , 2019, CHI.

[7]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Wei Liang,et al.  Virtual Agent Positioning Driven by Scene Semantics in Mixed Reality , 2019, 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[9]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Eyal Ofek,et al.  DreamWalker: Substituting Real-World Walking Experiences with a Virtual Reality , 2019, UIST.

[11]  Angel X. Chang,et al.  PlanIT , 2019, ACM Transactions on Graphics.

[12]  Thorsten Grosch,et al.  Natural Environment Illumination: Coherent Interactive Augmented Reality for Mobile and Non-Mobile Devices , 2017, IEEE Transactions on Visualization and Computer Graphics.

[13]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Roland Siegwart,et al.  Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery , 2019, IEEE Robotics and Automation Letters.

[18]  Evangelos Kalogerakis,et al.  SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[20]  Larry S. Davis,et al.  Generating Holistic 3D Scene Abstractions for Text-Based Image Retrieval , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Silvio Savarese,et al.  3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[24]  Pat Hanrahan,et al.  Characterizing structural relationships in scenes using graph kernels , 2011, ACM Trans. Graph..

[25]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Henry Been-Lirn Duh,et al.  Trends in augmented reality tracking, interaction and display: A review of ten years of ISMAR , 2008, 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality.

[27]  Pushmeet Kohli,et al.  FLARE: Fast layout for augmented reality applications , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[28]  Dieter Schmalstieg,et al.  Pose tracking from natural features on mobile phones , 2008, 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality.

[29]  Leonidas J. Guibas,et al.  Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Long Chen,et al.  Context-Aware Mixed Reality: A Framework for Ubiquitous Interaction , 2018, ArXiv.

[32]  Vincent Lepetit,et al.  Instant Outdoor Localization and SLAM Initialization from 2.5D Maps , 2015, IEEE Transactions on Visualization and Computer Graphics.

[33]  David W. Murray,et al.  Video-rate localization in multiple maps for wearable augmented reality , 2008, 2008 12th IEEE International Symposium on Wearable Computers.

[34]  Florentin Wörgötter,et al.  Learning to Segment Affordances , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[35]  Jan Kautz,et al.  Geometry-Aware Learning of Maps for Camera Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[37]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[38]  Hirokazu Kato,et al.  Marker tracking and HMD calibration for a video-based augmented reality conferencing system , 1999, Proceedings 2nd IEEE and ACM International Workshop on Augmented Reality (IWAR'99).

[39]  Tomoya Ishikawa,et al.  PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[41]  Matthias Nießner,et al.  Scan2CAD: Learning CAD Model Alignment in RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[44]  Jong-Hwan Kim,et al.  3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents , 2019, IEEE Transactions on Cybernetics.

[45]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Cheng Zhang,et al.  An Empirical Study on Leveraging Scene Graphs for Visual Question Answering , 2019, BMVC.

[47]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[48]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[49]  Juergen Gall,et al.  Weakly Supervised Affordance Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Binbin Xu,et al.  MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[52]  Kangsoo Kim,et al.  Revisiting Trends in Augmented Reality Research: A Review of the 2nd Decade of ISMAR (2008–2017) , 2018, IEEE Transactions on Visualization and Computer Graphics.

[53]  Torsten Sattler,et al.  BAD SLAM: Bundle Adjusted Direct RGB-D SLAM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[55]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.