Supervised Fine-tuning in turn Improves Visual Foundation Models
暂无分享,去创建一个
Yuying Ge | Xiaohu Jiang | Yixiao Ge | Ying Shan | Chun Yuan
[1] Tiejun Huang,et al. SVIT: Scaling up Visual Instruction Tuning , 2023, ArXiv.
[2] Lingpeng Kong,et al. M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning , 2023, ArXiv.
[3] Kai Chen,et al. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.
[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, NeurIPS.
[5] Michael G. Rabbat,et al. DINOv2: Learning Robust Visual Features without Supervision , 2023, Trans. Mach. Learn. Res..
[6] Ledell Yu Wu,et al. EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.
[7] Tiejun Huang,et al. EVA-02: A Visual Representation for Neon Genesis , 2023, Image and Vision Computing.
[8] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[9] Quoc V. Le,et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.
[10] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[11] Ying Shen,et al. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning , 2022, ACL.
[12] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.
[13] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Alexei A. Efros,et al. InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Li Dong,et al. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers , 2022, ArXiv.
[17] Xiaogang Wang,et al. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs , 2022, NeurIPS.
[18] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[19] Dan Xu,et al. Inverted Pyramid Multi-task Transformer for Dense Scene Understanding , 2022, ECCV.
[20] Lu Yuan,et al. RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.
[25] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[26] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.
[27] Rowel Atienza,et al. Vision Transformer for Fast and Efficient Scene Text Recognition , 2021, ICDAR.
[28] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Ronghang Hu,et al. UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[32] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[33] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[35] Michael Crawshaw,et al. Multi-Task Learning with Deep Neural Networks: A Survey , 2020, ArXiv.
[36] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[37] D. Song,et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Leonidas Guibas,et al. Robust Learning Through Cross-Task Consistency , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[40] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[41] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Priyanka Agrawal,et al. OmniNet: A unified architecture for multi-modal multi-task learning , 2019, ArXiv.
[43] Dawn Song,et al. Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Jitendra Malik,et al. Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.
[46] Eric P. Xing,et al. Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.
[47] Marcel Worring,et al. Many Task Learning With Task Routing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.
[50] Thomas Wolf,et al. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks , 2018, AAAI.
[51] Allan Jabri,et al. Learning Visually Grounded Sentence Representations , 2018, NAACL.
[52] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[53] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[54] Andreas Dengel,et al. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
[55] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[56] Lukasz Kaiser,et al. One Model To Learn Them All , 2017, ArXiv.
[57] Xuanjing Huang,et al. Adversarial Multi-task Learning for Text Classification , 2017, ACL.
[58] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[59] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.
[60] Yoshimasa Tsuruoka,et al. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.
[61] Anders Søgaard,et al. Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.
[62] A. Vedaldi,et al. Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.
[64] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[65] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.
[66] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[67] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.
[68] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[69] Palaiahnakote Shivakumara,et al. Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.
[70] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.
[71] Johannes Stallkamp,et al. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.
[72] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.
[73] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.
[74] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[75] Pietro Perona,et al. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.
[76] A. Meyers. Reading , 1999, Language Teaching.
[77] Rich Caruana,et al. Multitask Learning , 1997, Machine Learning.
[78] Hanrong Ye,et al. TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding , 2023, ICLR.
[79] Matthew E. Peters,et al. HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation , 2022, ArXiv.
[80] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[81] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[82] Westgate Road,et al. Terms & Conditions of Use , 2004 .