Curiosity-driven 3D Object Detection Without Labels

In this paper we set out to solve the task of 6-DOF 3D object detection from 2D images, where the only supervision is a geometric representation of the objects we aim to find. In doing so, we remove the need for 6-DOF labels (i.e. position, orientation etc.), allowing our network to be trained on unlabeled images in a self-supervised manner. We achieve this through a neural network which learns an explicit scene parameterization which is subsequently passed into a differentiable renderer. We analyze why analysis-by-synthesis-like losses for supervision of 3D scene structure using differentiable rendering is not practical, as it almost always gets stuck in local minima of visual ambiguities. This can be overcome by a novel form of training, where an additional network is employed to steer the optimization itself to explore the entire parameter space i.e. to be curious, and hence, to resolve those ambiguities and find workable minima.

[1]  Adrien Gaidon,et al.  Monocular Differentiable Rendering for Self-Supervised 3D Object Detection , 2020, ECCV.

[2]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[3]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[4]  Jana Kosecka,et al.  3D Bounding Box Estimation Using Deep Learning and Geometry , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Saining Xie,et al.  Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jaakko Lehtinen,et al.  Differentiable Monte Carlo ray tracing through edge sampling , 2018, ACM Trans. Graph..

[9]  Yan Lu,et al.  Weakly Supervised 3D Object Detection from Point Clouds , 2020, ACM Multimedia.

[10]  Sanja Fidler,et al.  Monocular 3D Object Detection for Autonomous Driving , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrea Simonelli,et al.  Disentangling Monocular 3D Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Zoltan-Csaba Marton,et al.  Augmented Autoencoders: Implicit 3D Orientation Learning for 6D Object Detection , 2019, International Journal of Computer Vision.

[13]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Adrien Gaidon,et al.  Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bin Tian,et al.  CenterNet3D: An Anchor free Object Detector for Autonomous Driving , 2020, ArXiv.

[16]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Matthias Zwicker,et al.  DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images , 2020, ICML.

[18]  Alexander G. Schwing,et al.  3D Spatial Recognition without Spatially Labeled 3D , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Zizhang Wu,et al.  SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Jason Rambach,et al.  SynPo-Net—Accurate and Fast CNN-Based 6DoF Object Pose Estimation Using Synthetic Training , 2021, Sensors.

[24]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Dragomir Anguelov,et al.  Scalability in Perception for Autonomous Driving: Waymo Open Dataset , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[27]  Timothy Patten,et al.  Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Marcus Vetter,et al.  EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach , 2020, ArXiv.

[29]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Pascal Fua,et al.  Real-Time Seamless Single Shot 6D Object Pose Prediction , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Hongzi Zhu,et al.  Monocular 3D Object Detection: An Extrinsic Parameter Free Approach , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Zengyi Qin,et al.  MonoGRNet: A General Framework for Monocular 3D Object Detection , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[34]  Tobias Ritschel,et al.  Escaping Plato’s Cave: 3D Shape From Adversarial Rendering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Tobias Ritschel,et al.  Finding Your (3D) Center: 3D Object Detection Using a Learned Loss , 2020, ECCV.

[36]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Qiang Xu,et al.  nuScenes: A Multimodal Dataset for Autonomous Driving , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Vittorio Ferrari,et al.  Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading , 2019, International Journal of Computer Vision.

[39]  Jaakko Lehtinen,et al.  Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer , 2019, NeurIPS.

[40]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jitendra Malik,et al.  Multi-view Supervision for Single-View Reconstruction via Differentiable Ray Consistency , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Vincent Lepetit,et al.  Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes , 2012, ACCV.

[43]  Karol Myszkowski,et al.  X-Fields , 2020, ACM Trans. Graph..

[44]  Vincent Lepetit,et al.  Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes , 2011, 2011 International Conference on Computer Vision.