Move to See Better: Towards Self-Supervised Amodal Object Detection

Humans learn to better understand the world by moving around their environment to get more informative viewpoints of the scene. Most methods for 2D visual recognition tasks such as object detection and segmentation treat images of the same scene as individual samples and do not exploit object permanence in multiple views. Generalization to novel scenes and views thus requires additional training with lots of human annotations. In this paper, we propose a self-supervised framework to improve an object detector in unseen scenarios by moving an agent around in a 3D environment and aggregating multi-view RGB-D information. We unproject confident 2D object detections from the pre-trained detector and perform unsupervised 3D segmentation on the point cloud. The segmented 3D objects are then re-projected to all other views to obtain pseudo-labels for fine-tuning. Experiments on both indoor and outdoor datasets show that (1) our framework performs high-quality 3D segmentation from raw RGB-D data and a pre-trained 2D detector; (2) fine-tuning with self-supervision improves the 2D detector significantly where an unseen RGB image is given as input at test time; (3) training a 3D detector with self-supervision outperforms a comparable self-supervised method by a large margin.

[1]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Katerina Fragkiadaki,et al.  Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping , 2020, ECCV.

[3]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[4]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[5]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[6]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[7]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Abhinav Gupta,et al.  Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[11]  Yin Zhou,et al.  VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Ulrich Neumann,et al.  SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Abhinav Gupta,et al.  The More You Know: Using Knowledge Graphs for Image Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ali Farhadi,et al.  SeGAN: Segmenting and Generating the Invisible , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[16]  Silvio Savarese,et al.  Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[18]  Kristen Grauman,et al.  End-to-End Policy Learning for Active Visual Categorization , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[20]  Jitendra Malik,et al.  Learning Instance Segmentation by Interaction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[21]  Jitendra Malik,et al.  Cost-Sensitive Active Learning for Intracranial Hemorrhage Detection , 2018, MICCAI.

[22]  Jana Kosecka,et al.  A dataset for developing and benchmarking active vision , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[24]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[25]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[27]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Linda B. Smith,et al.  Toddler-Inspired Visual Object Learning , 2018, NeurIPS.

[29]  Abhinav Gupta,et al.  Semantic Curiosity for Active Visual Learning , 2020, ECCV.

[30]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[31]  S. Edelman,et al.  Orientation dependence in the recognition of familiar and novel views of three-dimensional objects , 1992, Vision Research.

[32]  Xinlei Chen,et al.  Embodied Amodal Recognition: Learning to Move to Perceive Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[37]  Bin Yang,et al.  SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Raquel Urtasun,et al.  Deep Parametric Continuous Convolutional Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[41]  Sergio L. Netto,et al.  A Survey on Performance Metrics for Object-Detection Algorithms , 2020, 2020 International Conference on Systems, Signals and Image Processing (IWSSIP).

[42]  Nikolai Smolyanskiy,et al.  On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Stefan Lee,et al.  Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition , 2018, CoRL.

[44]  Jitendra Malik,et al.  Amodal Instance Segmentation , 2016, ECCV.

[45]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[46]  Bence Nanay,et al.  The Importance of Amodal Completion in Everyday Perception , 2018, i-Perception.

[47]  Alan Yuille,et al.  Active Vision , 2014, Computer Vision, A Reference Guide.

[48]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Stefan Lee,et al.  Embodied Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Kilian Q. Weinberger,et al.  LDLS: 3-D Object Segmentation Through Label Diffusion From 2-D Images , 2019, IEEE Robotics and Automation Letters.

[51]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[53]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[54]  M. Tarr Rotating objects to recognize them: A case study on the role of viewpoint dependency in the recognition of three-dimensional objects , 1995, Psychonomic bulletin & review.

[55]  Stefan Leutenegger,et al.  Pairwise Decomposition of Image Sequences for Active Multi-view Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Leonidas J. Guibas,et al.  Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Dana H. Ballard,et al.  Animate Vision , 1991, Artif. Intell..

[60]  Scott Cheng-Hsin Yang,et al.  Active sensing in the categorization of visual patterns , 2016, eLife.

[61]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..