HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding
暂无分享,去创建一个
Weihua Luo | Xiaolin Wei | Wei Zhang | Lin Ma | Jiaming Chen
[1] Xinchao Wang,et al. Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning , 2022, NeurIPS.
[2] Weihua Luo,et al. A Circular Window-based Cascade Transformer for Online Action Detection , 2022, ArXiv.
[3] Wei Zhang,et al. Explore Inter-contrast between Videos via Composition for Weakly Supervised Temporal Sentence Grounding , 2022, AAAI.
[4] Jing Zhang,et al. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Jun Liu,et al. Surface Representation for Point Clouds , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Haibing Ren,et al. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Jiaya Jia,et al. Multi-View Transformer for 3D Visual Grounding , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Yuxin Chen,et al. Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Trishul M. Chilimbi,et al. Vision-Language Pre-Training with Triple Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Katerina Fragkiadaki,et al. Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds , 2021, ECCV.
[12] Shenghua Gao,et al. SVIP: Sequence VerIfication for Procedures in Videos , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] D. Rukhovich,et al. FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection , 2021, ECCV.
[14] Bryan A. Plummer,et al. Revisiting Image-Language Networks for Open-Ended Phrase Detection , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[15] Dong Xu,et al. 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Junyu Luo,et al. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding , 2021, ACM Multimedia.
[17] Ali Farhadi,et al. LanguageRefer: Spatial-Language Model for 3D Visual Grounding , 2021, CoRL.
[18] Songyang Zhang,et al. SAT: 2D Semantics Assisted Training for 3D Visual Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Hwann-Tzong Chen,et al. Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation , 2021, AAAI.
[20] Chunhua Shen,et al. Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.
[21] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[22] Wengang Zhou,et al. TransVG: End-to-End Visual Grounding with Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] Shenghua Gao,et al. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Zheng Zhang,et al. Group-Free 3D Object Detection via Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Liang Zhang,et al. Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Yizhou Yu,et al. Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images , 2021, Computer Vision and Pattern Recognition.
[27] Ruimao Zhang,et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[29] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[30] Angel X. Chang,et al. Scan2Cap: Context-aware Dense Captioning in RGB-D Scans , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[32] Angel X. Chang,et al. D3Net: A Speaker-Listener Architecture for Semi-supervised Dense Captioning and Visual Grounding in RGB-D Scans , 2021, ArXiv.
[33] Katerina Fragkiadaki,et al. Looking Outside the Box to Ground Language in 3D Scenes , 2021 .
[34] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Ankit Goyal,et al. Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D , 2020, NeurIPS.
[36] Ahmed Abdelreheem,et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes , 2020, ECCV.
[37] Jiebo Luo,et al. Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.
[38] Anton van den Hengel,et al. Object-and-Action Aware Model for Visual Language Navigation , 2020, ECCV.
[39] Yanan Sun,et al. 3DSSD: Point-Based 3D Single Stage Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Angel X. Chang,et al. ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language , 2019, ECCV.
[41] Lin Ma,et al. Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[42] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[43] Jiebo Luo,et al. A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[44] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[45] Leonidas J. Guibas,et al. Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[46] José M. F. Moura,et al. CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog , 2019, NAACL.
[47] Lianli Gao,et al. Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Xiaogang Wang,et al. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[50] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[51] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[52] Bin Yang,et al. Deep Continuous Fusion for Multi-sensor 3D Object Detection , 2018, ECCV.
[53] Bin Xu,et al. Multi-level Fusion Based 3D Object Detection from Monocular Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[54] Bin Yang,et al. PIXOR: Real-time 3D Object Detection from Point Clouds , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[55] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[56] Steven Lake Waslander,et al. Joint 3D Proposal Generation and Object Detection from View Aggregation , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[57] Stefan Lee,et al. Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[58] Leonidas J. Guibas,et al. Frustum PointNets for 3D Object Detection from RGB-D Data , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[59] Yin Zhou,et al. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[61] Leonidas J. Guibas,et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.
[62] Matthias Nießner,et al. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Ji Wan,et al. Multi-view 3D Object Detection Network for Autonomous Driving , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[66] Silvio Savarese,et al. 3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Jianxiong Xiao,et al. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[68] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Jianxiong Xiao,et al. SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[70] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[71] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[72] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.
[73] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[74] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.
[75] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[76] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[77] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.