论文信息 - Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

Open-Vocabulary Point-Cloud Object Detection without 3D Annotation

The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models,i.e.,CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works.

[1] Francesco Cappio Borlino,et al. 3DOS: Towards 3D Open Set Learning - Benchmarking and Understanding Semantic Novelty Detection on Point Clouds , 2022, NeurIPS.

[2] Thomas Kipf,et al. Simple Open-Vocabulary Object Detection with Vision Transformers , 2022, ArXiv.

[3] O. Litany,et al. Language-Grounded Indoor 3D Semantic Segmentation in the Wild , 2022, ECCV.

[4] Li Dong,et al. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment , 2022, ACL.

[5] Armand Joulin,et al. Detecting Twenty-thousand Classes using Image-level Supervision , 2022, ECCV.

[6] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Peng Gao,et al. PointCLIP: Point Cloud Understanding by CLIP , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Bichen Wu,et al. Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models , 2021, ECCV.

[10] Peng Yun,et al. Open-set 3D Object Detection , 2021, 2021 International Conference on 3D Vision (3DV).

[11] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[12] Shih-Fu Chang,et al. Open-Vocabulary Object Detection Using Captions , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14] Fangyun Wei,et al. A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model , 2021, ArXiv.

[15] Xiuye Gu,et al. Zero-Shot Detection via Vision and Language Knowledge Distillation , 2021, ArXiv.

[16] Yang Zou,et al. Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection , 2020, NeurIPS.

[17] Fatih Porikli,et al. Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts , 2020, International Journal of Computer Vision.

[18] Ching-Yao Chuang,et al. Debiased Contrastive Learning , 2020, NeurIPS.

[19] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[20] Nick Barnes,et al. Improved Visual-Semantic Alignment for Zero-Shot Object Detection , 2020, AAAI.

[21] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[22] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[23] Yan Wang,et al. Enabling Deep Residual Networks for Weakly Supervised Object Detection , 2020, ECCV.

[24] Raquel Urtasun,et al. Identifying Unknown Instances for Autonomous Driving , 2019, CoRL.

[25] Liujuan Cao,et al. Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Ross B. Girshick,et al. LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Shiguang Shan,et al. Weakly Supervised Object Detection With Segmentation Collaboration , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jianxiong Xiao,et al. SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[32] Jianxiong Xiao,et al. 3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).