论文信息 - Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.

[1] Ying Shan,et al. Accelerating Vision-Language Pretraining with Free Language Modeling , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[3] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[4] Zhengjue Wang,et al. ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.

[7] Jiahao Xie,et al. Controllable Image Captioning via Prompting , 2022, AAAI.

[8] Zhe Gan,et al. GRiT: A Generative Region-to-text Transformer for Object Understanding , 2022, ArXiv.

[9] Guillem Cucurull,et al. Galactica: A Large Language Model for Science , 2022, ArXiv.

[10] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[11] Dragomir R. Radev,et al. Crosslingual Generalization through Multitask Finetuning , 2022, ArXiv.

[12] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[13] Q. Tian,et al. DeeCap: Dynamic Early Exiting for Efficient Image Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Ting Yao,et al. Comprehending and Ordering Semantics for Image Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[16] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[17] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[18] Kurt Debattista,et al. Region-Object Relation-Aware Dense Captioning via Transformer. , 2022, IEEE transactions on neural networks and learning systems.

[19] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[20] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[21] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[22] Xiaowei Hu,et al. Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Wei Liu,et al. Human-like Controllable Image Captioning with Verb-specific Semantic Roles , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Nan Duan,et al. Control Image Captioning Spatially and Temporally , 2021, ACL.

[26] Ning Ding,et al. Length-Controllable Image Captioning , 2020, ECCV.

[27] Zhao Zhang,et al. Interactive Image Segmentation With First Click Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[29] Wentian Zhao,et al. MemCap: Memorizing Style Knowledge for Image Captioning , 2020, AAAI.

[30] Ilia Petrov,et al. F-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Jordi Pont-Tuset,et al. Connecting Vision and Language with Localized Narratives , 2019, ECCV.

[32] Jie Chen,et al. Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Jungong Han,et al. Learning Object Context for Dense Captioning , 2019, AAAI.

[34] Nenghai Yu,et al. Context and Attribute Grounded Dense Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Rita Cucchiara,et al. Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Sébastien Ourselin,et al. DeepIGeoS: A Deep Interactive Geodesic Framework for Medical Image Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Zhuwen Li,et al. Interactive Image Segmentation with Latent Diversity , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Bastian Leibe,et al. Iteratively Trained Interactive Segmentation , 2018, BMVC.

[39] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Sim Heng Ong,et al. Regional Interactive Image Segmentation Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41] Zhe Gan,et al. StyleNet: Generating Attractive Visual Captions with Styles , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Li-Jia Li,et al. Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Ning Xu,et al. Deep Interactive Object Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Lexing Xie,et al. SentiCap: Generating Image Descriptions with Sentiments , 2015, AAAI.

[46] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[47] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).