InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

[1]  Jiannan Wu,et al.  VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks , 2023, NeurIPS.

[2]  Jianfeng Gao,et al.  A Simple Framework for Open-Vocabulary Segmentation and Detection , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Shalini De Mello,et al.  Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yong Jae Lee,et al.  Generalized Decoding for Pixel, Image, and Language , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chunhua Shen,et al.  Images Speak in Images: A Generalist Painter for In-Context Visual Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Hongsheng Li,et al.  Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  P. Luo,et al.  DiffusionDet: Diffusion Model for Object Detection , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[10]  Alexei A. Efros,et al.  Visual Prompting via Image Inpainting , 2022, NeurIPS.

[11]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[12]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[13]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[14]  David J. Fleet,et al.  A Unified Sequence Interface for Vision Tasks , 2022, NeurIPS.

[15]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[16]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[17]  André Susano Pinto,et al.  UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes , 2022, NeurIPS.

[18]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[19]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[20]  Junjun Jiang,et al.  BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation , 2022, ArXiv.

[21]  Xianming Liu,et al.  DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation , 2022, Machine Intelligence Research.

[22]  P. Battaglia,et al.  Transframer: Arbitrary Frame Prediction with Generative Models , 2022, Trans. Mach. Learn. Res..

[23]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[24]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[26]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xizhou Zhu,et al.  Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jong-Chul Ye,et al.  CLIPstyler: Image Style Transfer with a Single Text Condition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[30]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[31]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[32]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[33]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[34]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[37]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[38]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Chi-Keung Tang,et al.  FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[41]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[43]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[46]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[49]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Michael I. Jordan,et al.  AUTO-ENCODING VARIATIONAL BAYES , 2020 .

[52]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .