Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding
暂无分享,去创建一个
[1] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[2] Weihua Luo,et al. HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding , 2022, ArXiv.
[3] Jing Zhang,et al. Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline , 2022, IEEE Transactions on Circuits and Systems for Video Technology.
[4] Jing Zhang,et al. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[6] Weidong (Tom) Cai,et al. Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds , 2022, IJCAI.
[7] Haibing Ren,et al. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Jiaya Jia,et al. Multi-View Transformer for 3D Visual Grounding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Yu-Gang Jiang,et al. MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes , 2022, ECCV.
[10] Trishul M. Chilimbi,et al. Vision-Language Pre-Training with Triple Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[12] M. Kawanabe,et al. ScanQA: 3D Question Answering for Spatial Scene Understanding , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Katerina Fragkiadaki,et al. Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds , 2021, ECCV.
[14] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Dong Xu,et al. 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[17] Songyang Zhang,et al. SAT: 2D Semantics Assisted Training for 3D Visual Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[18] Hwann-Tzong Chen,et al. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.
[19] Liang Zhang,et al. Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[20] Ruimao Zhang,et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[22] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[23] Jing Yu Koh,et al. Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Angel X. Chang,et al. Scan2Cap: Context-aware Dense Captioning in RGB-D Scans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.
[26] Mohammed Bennamoun,et al. Deep Learning for 3D Point Clouds: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[27] Angel X. Chang,et al. D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans , 2021, ArXiv.
[28] Ahmed Abdelreheem,et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.
[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[30] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[31] Angel X. Chang,et al. ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2019, ECCV.
[32] Zhaohui Zheng,et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.
[33] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[34] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[35] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[36] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[37] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[38] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[39] Geoffrey E. Hinton,et al. When Does Label Smoothing Help? , 2019, NeurIPS.
[40] Leonidas J. Guibas,et al. Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[41] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[42] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[43] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[45] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[48] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[49] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[50] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[51] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.