InternGPT: Solving Vision-Centric Tasks by Interacting with Chatbots Beyond Language
暂无分享,去创建一个
Jifeng Dai | Yi Wang | Yali Wang | Xizhou Zhu | Yang Yang | Zhaoyang Liu | Wenhai Wang | Qing-Long Zhang | Ping Luo | Shoufa Chen | Kunchang Li | Jiashuo Yu | Zhe Chen | Yu Qiao | Weiyun Wang | Yinan He | Qingyun Li | Limin Wang | Xuecheng Yang
[1] Yi Wang,et al. VideoChat: Chat-Centric Video Understanding , 2023, ArXiv.
[2] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[3] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[4] Minghao Li,et al. API-Bank: A Benchmark for Tool-Augmented LLMs , 2023, ArXiv.
[5] Michael G. Rabbat,et al. DINOv2: Learning Robust Visual Features without Supervision , 2023, Trans. Mach. Learn. Res..
[6] Junchi Yan,et al. H2RBox-v2: Boosting HBox-supervised Oriented Object Detection via Symmetric Learning , 2023, ArXiv.
[7] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace , 2023, ArXiv.
[8] P. Luo,et al. DDP: Diffusion Model for Dense Visual Prediction , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[9] Yi Wang,et al. VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Chenfei Wu,et al. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs , 2023, Intelligent Computing.
[11] Yi Wang,et al. Unmasked Teacher: Towards Training-Efficient Video Foundation Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[12] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.
[13] Jun-Juan Zhu,et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , 2023, ECCV.
[14] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.
[15] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[16] Maneesh Agrawala,et al. Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.
[17] Quoc V. Le,et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.
[18] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[19] Limin Wang,et al. BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection , 2022, Comput. Vis. Image Underst..
[20] Xi Victoria Lin,et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization , 2022, ArXiv.
[21] Noah A. Smith,et al. Self-Instruct: Aligning Language Model with Self Generated Instructions , 2022, ArXiv.
[22] Yi Wang,et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning , 2022, ArXiv.
[23] Jong Wook Kim,et al. Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.
[24] Jamie Callan,et al. PAL: Program-aided Language Models , 2022, ICML.
[25] Kunchang Li,et al. InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges , 2022, ArXiv.
[26] Kunchang Li,et al. UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer , 2022, ArXiv.
[27] Humphrey Shi,et al. OneFormer: One Transformer to Rule Universal Image Segmentation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Hongsheng Li,et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[30] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.
[31] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.
[32] Jifeng Dai,et al. Vision Transformer Adapter for Dense Predictions , 2022, ICLR.
[33] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[34] Noah A. Smith,et al. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.
[35] Limin Wang,et al. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.
[36] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.
[37] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[38] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[39] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[40] Yali Wang,et al. UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning , 2022, ICLR.
[41] Trevor Darrell,et al. A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Anima Anandkumar,et al. Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] P. Luo,et al. PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.
[46] Jeff Wu,et al. WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.
[47] Alexander G. Schwing,et al. Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.
[48] Quoc V. Le,et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.
[49] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[50] Fengwei Yu,et al. Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[51] Ari S. Morcos,et al. ConViT: improving vision transformers with soft convolutional inductive biases , 2021, ICML.
[52] Xiang Li,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[53] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[54] Junchi Yan,et al. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss , 2021, ICML.
[55] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[56] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.
[57] Junchi Yan,et al. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object , 2019, AAAI.
[58] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[59] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[60] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[61] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[62] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[63] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[64] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[65] Yue Zhang,et al. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[66] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[67] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[68] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[69] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[70] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[71] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[73] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[75] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[76] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.