论文信息 - Supervised Fine-tuning in turn Improves Visual Foundation Models

Supervised Fine-tuning in turn Improves Visual Foundation Models

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

Yuying Ge | Xiaohu Jiang | Yixiao Ge | Ying Shan | Chun Yuan

[1] Tiejun Huang,et al. SVIT: Scaling up Visual Instruction Tuning , 2023, ArXiv.

[2] Lingpeng Kong,et al. M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning , 2023, ArXiv.

[3] Kai Chen,et al. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans , 2023, ArXiv.

[4] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, NeurIPS.

[5] Michael G. Rabbat,et al. DINOv2: Learning Robust Visual Features without Supervision , 2023, Trans. Mach. Learn. Res..

[6] Ledell Yu Wu,et al. EVA-CLIP: Improved Training Techniques for CLIP at Scale , 2023, ArXiv.

[7] Tiejun Huang,et al. EVA-02: A Visual Representation for Neon Genesis , 2023, Image and Vision Computing.

[8] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[9] Quoc V. Le,et al. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.

[10] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[11] Ying Shen,et al. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning , 2022, ACL.

[12] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[13] Hongsheng Li,et al. Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Alexei A. Efros,et al. InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Ledell Yu Wu,et al. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Li Dong,et al. BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers , 2022, ArXiv.

[17] Xiaogang Wang,et al. Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs , 2022, NeurIPS.

[18] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[19] Dan Xu,et al. Inverted Pyramid Multi-task Transformer for Dense Scene Understanding , 2022, ECCV.

[20] Lu Yuan,et al. RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[25] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[26] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[27] Rowel Atienza,et al. Vision Transformer for Fast and Efficient Scene Text Recognition , 2021, ICDAR.

[28] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30] Ronghang Hu,et al. UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[32] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[33] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[35] Michael Crawshaw,et al. Multi-Task Learning with Deep Neural Networks: A Survey , 2020, ArXiv.

[36] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[37] D. Song,et al. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38] Leonidas Guibas,et al. Robust Learning Through Cross-Task Consistency , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[40] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[42] Priyanka Agrawal,et al. OmniNet: A unified architecture for multi-modal multi-task learning , 2019, ArXiv.

[43] Dawn Song,et al. Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Jitendra Malik,et al. Which Tasks Should Be Learned Together in Multi-task Learning? , 2019, ICML.

[46] Eric P. Xing,et al. Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[47] Marcel Worring,et al. Many Task Learning With Task Routing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48] Christopher D. Manning,et al. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[50] Thomas Wolf,et al. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks , 2018, AAAI.

[51] Allan Jabri,et al. Learning Visually Grounded Sentence Representations , 2018, NAACL.

[52] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[54] Andreas Dengel,et al. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[55] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56] Lukasz Kaiser,et al. One Model To Learn Them All , 2017, ArXiv.

[57] Xuanjing Huang,et al. Adversarial Multi-task Learning for Text Classification , 2017, ACL.

[58] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[59] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[60] Yoshimasa Tsuruoka,et al. A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[61] Anders Søgaard,et al. Deep multi-task learning with low level tasks supervised at lower layers , 2016, ACL.

[62] A. Vedaldi,et al. Synthetic Data for Text Localisation in Natural Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[64] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[65] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[66] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[67] Andrew Zisserman,et al. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition , 2014, ArXiv.

[68] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[69] Palaiahnakote Shivakumara,et al. Recognizing Text with Perspective Distortion in Natural Scenes , 2013, 2013 IEEE International Conference on Computer Vision.

[70] Jon Almazán,et al. ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[71] Johannes Stallkamp,et al. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[72] Kai Wang,et al. End-to-end scene text recognition , 2011, 2011 International Conference on Computer Vision.

[73] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[74] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[75] Pietro Perona,et al. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[76] A. Meyers. Reading , 1999, Language Teaching.

[77] Rich Caruana,et al. Multitask Learning , 1997, Machine Learning.

[78] Hanrong Ye,et al. TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding , 2023, ICLR.

[79] Matthew E. Peters,et al. HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation , 2022, ArXiv.

[80] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[82] Westgate Road,et al. Terms & Conditions of Use , 2004 .