Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
暂无分享,去创建一个
Shuming Shi | Zhaopeng Tu | Shuming Shi | Longyue Wang | Bingshuai Liu | Xinting Huang | Zefeng Du | Chenyang Lyu | Minghao Wu | Xinting Huang | Bingshuai Liu
[1] Yan Wang,et al. PandaGPT: One Model To Instruction-Follow Them All , 2023, TLLM.
[2] S. Levine,et al. The False Promise of Imitating Proprietary LLMs , 2023, ArXiv.
[3] Alham Fikri Aji,et al. Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation , 2023, ArXiv.
[4] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.
[5] Boyang Li,et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.
[6] Kalyan Vasudev Alwala,et al. ImageBind One Embedding Space to Bind Them All , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Kai Chen,et al. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.
[8] Bo Xu,et al. X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages , 2023, ArXiv.
[9] Yuanhan Zhang,et al. Otter: A Multi-Modal Model with In-Context Instruction Tuning , 2023, ArXiv.
[10] Jitao Xu,et al. New Trends in Machine Translation using Large Language Models: Case Examples with ChatGPT , 2023, ArXiv.
[11] Alham Fikri Aji,et al. LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions , 2023, EACL.
[12] Ming Yan,et al. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality , 2023, ArXiv.
[13] Mohamed Elhoseiny,et al. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.
[14] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.
[15] Zhaopeng Tu,et al. Document-Level Machine Translation with Large Language Models , 2023, ArXiv.
[16] Li Dong,et al. Language Is Not All You Need: Aligning Perception with Language Models , 2023, NeurIPS.
[17] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[18] George F. Foster,et al. Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation , 2023, EACL.
[19] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[20] Ying Shen,et al. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning , 2022, ACL.
[21] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[22] Jong Wook Kim,et al. Robust Speech Recognition via Large-Scale Weak Supervision , 2022, ICML.
[23] Dragomir R. Radev,et al. Crosslingual Generalization through Multitask Finetuning , 2022, ACL.
[24] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.
[25] Dani Yogatama,et al. Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.
[26] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[27] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[28] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[29] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.
[30] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.
[31] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[32] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.
[33] Can Xu,et al. Multimodal Dialogue Response Generation , 2021, ACL.
[34] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.
[35] M. Hasanuzzaman,et al. A Survey on Multi-modal Summarization , 2021, ACM Comput. Surv..
[36] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[37] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.
[38] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[39] Jason Weston,et al. Multi-Modal Open-Domain Dialogue , 2020, EMNLP.
[40] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[41] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[42] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Haoran Li,et al. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video , 2017, EMNLP.
[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[45] Andy Way,et al. Exploiting Cross-Sentence Context for Neural Machine Translation , 2017, EMNLP.
[46] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[47] Ali Farhadi,et al. Much Ado About Time: Exhaustive Annotation of Temporal Data , 2016, HCOMP.
[48] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[49] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.