CAD-Estate: Large-scale CAD Model Annotation in RGB Videos

We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects. We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor. Many steps are performed automatically, and the tasks performed by humans are simple, well-specified, and require only limited reasoning in 3D. This makes them feasible for crowd-sourcing and has allowed us to construct a large-scale dataset by annotating real-estate videos from YouTube. Our dataset CAD-Estate offers 108K instances of 12K unique CAD models placed in the 3D representations of 21K videos. In comparison to Scan2CAD, the largest existing dataset with CAD model annotations on real scenes, CAD-Estate has 8x more instances and 4x more unique CAD models. We showcase the benefits of pre-training a Mask2CAD model on CAD-Estate for the task of automatic 3D object reconstruction and pose estimation, demonstrating that it leads to improvements on the popular Scan2CAD benchmark. We will release the data by mid July 2023.

[1]  Sergey Zakharov,et al.  Photo-realistic Neural Domain Randomization , 2022, ECCV.

[2]  V. Ferrari,et al.  RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers , 2022, ECCV.

[3]  Leon L. Xu,et al.  ABO: Dataset and Benchmarks for Real-World 3D Object Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Patrick Labatut,et al.  Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Ian Reid,et al.  ODAM: Object Detection, Association, and Mapping using Posed RGB Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Anelia Angelova,et al.  Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Anton Konushin,et al.  ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[8]  Vladlen Koltun,et al.  Enhancing Photorealism Enhancement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Matthias Grundmann,et al.  Objectron: A Large Scale Dataset of Object-Centric Videos in the Wild with Pose Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ian Reid,et al.  MOLTR: Multiple Object Localization, Tracking and Reconstruction From Monocular RGB Videos , 2020, IEEE Robotics and Automation Letters.

[11]  Vittorio Ferrari,et al.  Vid2CAD: CAD Model Alignment Using Multi-View Constraints From Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Peng Liu,et al.  3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Mike Roberts,et al.  Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Lin Gao,et al.  3D-FUTURE: 3D Furniture Shape with TextURE , 2020, International Journal of Computer Vision.

[15]  Angela Dai,et al.  Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve , 2020, ECCV.

[16]  Lourdes Agapito,et al.  FroDO: From Detections to 3D Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Pablo Bauszat,et al.  CoReNet: Coherent 3D scene reconstruction from a single RGB image , 2020, ECCV.

[18]  Quoc V. Le,et al.  SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[20]  Jitendra Malik,et al.  Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Laura Leal-Taixé,et al.  Tracking Without Bells and Whistles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Zhen Li,et al.  Graph-RISE: Graph-Regularized Image Semantic Embedding , 2019, ArXiv.

[23]  Marc Alexa,et al.  ABC: A Big CAD Model Dataset for Geometric Deep Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Matthias Nießner,et al.  Scan2CAD: Learning CAD Model Alignment in RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael Felsberg,et al.  ATOM: Accurate Tracking by Overlap Maximization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[28]  Qi-Xing Huang,et al.  Domain Transfer Through Deep Activation Matching , 2018, ECCV.

[29]  Wenbin Li,et al.  InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset , 2018, BMVC.

[30]  Kate Saenko,et al.  Syn2Real: A New Benchmark forSynthetic-to-Real Visual Domain Adaptation , 2018, ArXiv.

[31]  John Flynn,et al.  Stereo magnification , 2018, ACM Trans. Graph..

[32]  Stefan Roth,et al.  Matryoshka Networks: Predicting 3D Geometry via Nested Shape Layers , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Varun Jampani,et al.  Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  Jun Li,et al.  Im2Struct: Recovering 3D Shape Structure from a Single RGB Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Jiajun Wu,et al.  Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[37]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[38]  Frank Keller,et al.  Extreme Clicking for Efficient Object Annotation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[42]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[43]  Abhinav Gupta,et al.  Learning a Predictable and Generative Vector Representation for Objects , 2016, ECCV.

[44]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[45]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[48]  Andrew Zisserman,et al.  Detecting People Looking at Each Other in Videos , 2014, International Journal of Computer Vision.

[49]  Antonio Torralba,et al.  Parsing IKEA Objects: Fine Pose Estimation , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Stefan Roth,et al.  People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Mark A. Livingston,et al.  Camera Control in Three Dimensions with a Two-Dimensional Input Device , 2000, J. Graphics, GPU, & Game Tools.