IC3: Image Captioning by Committee Consensus

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method:"Image Captioning by Committee Consensus"($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee

[1]  Antoni B. Chan,et al.  On Distinctive Image Captioning via Comparing and Reweighting , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Noah A. Smith,et al.  PromptCap: Prompt-Guided Task-Aware Image Captioning , 2022, ArXiv.

[3]  S. Savarese,et al.  Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training , 2022, EMNLP.

[4]  Junyi Jessy Li,et al.  News Summarization and Evaluation in the Era of GPT-3 , 2022, ArXiv.

[5]  David A. Ross,et al.  Distribution Aware Metrics for Conditional Natural Language Generation , 2022, LREC.

[6]  Xiyang Dai,et al.  Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning , 2022, NeurIPS.

[7]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[8]  David A. Ross,et al.  What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[10]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[11]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[12]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[13]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[14]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[15]  Noah A. Smith,et al.  Transparent Human Evaluation for Image Captioning , 2021, NAACL.

[16]  Antoni B. Chan,et al.  On Diversity in Image Captioning: Metrics and Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Fine-Grained Image Captioning , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Shweta Mahajan,et al.  Diverse Image Captioning with Grounded Style , 2022, GCPR.

[19]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[20]  Namit Katariya,et al.  Medically Aware GPT-3 as a Data Generator for Medical Dialogue Summarization , 2021, NLPMC.

[21]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Ashish V. Thapliyal,et al.  Quality Estimation for Image Captions Based on Large-scale Human Evaluations , 2019, NAACL.

[24]  Stefan Roth,et al.  Diverse Image Captioning with Context-Object Split Latent Spaces , 2020, NeurIPS.

[25]  Lucia Specia,et al.  Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale , 2020, COLING.

[26]  Virapat Kieuvongngam,et al.  Automatic Text Summarization of COVID-19 Medical Research Articles using BERT and GPT-2 , 2020, ArXiv.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Abigale Stangl,et al.  "Person, Shoes, Tree. Is the Person Naked?" What People with Vision Impairments Want in Image Descriptions , 2020, CHI.

[29]  Stefan Roth,et al.  Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings , 2020, ICLR.

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[32]  Dhruv Batra,et al.  Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Zi Huang,et al.  Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation , 2019, ACM Multimedia.

[34]  Jungong Han,et al.  Learning Object Context for Dense Captioning , 2019, AAAI.

[35]  Nenghai Yu,et al.  Context and Attribute Grounded Dense Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tae-Hyun Oh,et al.  Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Alexander Schwing,et al.  Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander G. Schwing,et al.  Diverse and Coherent Paragraph Generation from Images , 2018, ECCV.

[39]  Piek T. J. M. Vossen,et al.  Measuring the Diversity of Automatic Image Descriptions , 2018, COLING.

[40]  Chang Zhou,et al.  Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning , 2018, IJCAI.

[41]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[42]  Svetlana Lazebnik,et al.  Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[43]  Krys J. Kochut,et al.  Text Summarization Techniques: A Brief Survey , 2017, International Journal of Advanced Computer Science and Applications.

[44]  Meredith Ringel Morris,et al.  Understanding Blind People's Experiences with Computer-Generated Captions of Social Media Images , 2017, CHI.

[45]  Regina Barzilay,et al.  Style Transfer from Non-Parallel Text by Cross-Alignment , 2017, NIPS.

[46]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Chuang Gan,et al.  Recurrent Topic-Transition GAN for Visual Paragraph Generation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Li-Jia Li,et al.  Dense Captioning with Joint Inference and Visual Context , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jonathan Krause,et al.  A Hierarchical Approach for Generating Descriptive Image Paragraphs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yueting Zhuang,et al.  Diverse Image Captioning via GroupTalk , 2016, IJCAI.

[52]  Jianfeng Gao,et al.  Deep Reinforcement Learning for Dialogue Generation , 2016, EMNLP.

[53]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[56]  Alon Lavie,et al.  Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output , 2008, WMT@ACL.

[57]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[58]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.