ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
暂无分享,去创建一个
Jing Liu | Jing Liu | Zehuan Yuan | Zijia Zhao | Longteng Guo | Xinxin Zhu | Si-Qing Chen | Tongtian Yue | Shuai Shao
[1] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, ArXiv.
[2] Kalyan Vasudev Alwala,et al. ImageBind One Embedding Space to Bind Them All , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Yu-Gang Jiang,et al. ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System , 2023, ArXiv.
[4] Ming Yan,et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.
[5] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[6] Jing Liu,et al. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset , 2023, ArXiv.
[7] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.
[8] Qiuqiang Kong,et al. WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research , 2023, ArXiv.
[9] Ledell Yu Wu,et al. EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.
[10] Faisal Ahmed,et al. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action , 2023, ArXiv.
[11] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.
[12] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.
[13] Mehdi S. M. Sajjadi,et al. PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.
[14] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[15] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[16] Ying Shen,et al. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning , 2022, ArXiv.
[17] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[18] Daniel C. Tompkins,et al. BEATs: Audio Pre-Training with Acoustic Tokenizers , 2022, ICML.
[19] Zhe Gan,et al. GRiT: A Generative Region-to-text Transformer for Object Understanding , 2022, ArXiv.
[20] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.
[21] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.
[22] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.
[23] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.
[24] Tu Minh Phuong,et al. Video Dialog as Conversation about Objects Living in Space-Time , 2022, ECCV.
[25] N. Codella,et al. i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.
[26] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[27] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[28] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[29] Yapeng Tian,et al. Learning to Answer Questions in Dynamic Audio-Visual Scenarios , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[31] Yanwen Guo,et al. Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation , 2022, ArXiv.
[32] Federico Raue,et al. Audioclip: Extending Clip to Image, Text and Audio , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[33] Guillaume Allibert,et al. Transformer Fusion for Indoor Rgb-D Semantic Segmentation , 2022, SSRN Electronic Journal.
[34] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.
[35] Le Song,et al. ProTo: Program-Guided Transformer for Program-Guided Tasks , 2021, NeurIPS.
[36] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[37] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[39] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Maja Pantic,et al. End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[41] Junkun Chen,et al. Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation , 2021, ICML.
[42] Zheng-Jun Zha,et al. Learning to Discretely Compose Reasoning Module Networks for Video Captioning , 2020, IJCAI.
[43] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[44] Tuomas Virtanen,et al. Clotho: an Audio Captioning Dataset , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[45] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[46] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[47] Yong Yin,et al. MMFNet: A Multi-modality MRI Fusion Network for Segmentation of Nasopharyngeal Carcinoma , 2018, Neurocomputing.
[48] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[49] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[50] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[52] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[53] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[54] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[55] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[56] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[57] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[58] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[59] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[61] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[62] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[64] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[65] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.