Are metrics measuring what they should? An evaluation of image captioning task metrics

Image Captioning is a current research task to describe the image content using the objects and their relationships in the scene. To tackle this task, two important research areas converge, artificial vision, and natural language processing. In Image Captioning, as in any computational intelligence task, the performance metrics are crucial for knowing how well (or bad) a method performs. In recent years, it has been observed that classical metrics based on n-grams are insufficient to capture the semantics and the critical meaning to describe the content in an image. Looking to measure how well or not the set of current and more recent metrics are doing, in this article, we present an evaluation of several kinds of Image Captioning metrics and a comparison between them using the well-known MS COCO dataset. The metrics were selected from the most used in prior works, they are those based on $n$-grams as BLEU, SacreBLEU, METEOR, ROGUE-L, CIDEr, SPICE, and those based on embeddings, such as BERTScore and CLIPScore. For this, we designed two scenarios; 1) a set of artificially build captions with several qualities, and 2) a comparison of some state-of-the-art Image Captioning methods. Interesting findings were found trying to answer the questions: Are the current metrics helping to produce high-quality captions? How do actual metrics compare to each other? What are the metrics really measuring?

[1]  Xiaowei Hu,et al.  Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Deep Learning-Based Image Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[4]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[5]  Gigliola Vaglini,et al.  Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search , 2021, IMPROVE.

[6]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[7]  Wei Liu,et al.  CPTR: Full Transformer Network for Image Captioning , 2021, ArXiv.

[8]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[9]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[10]  Mitesh M. Khapra,et al.  A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..

[11]  Justin Johnson,et al.  VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[13]  Tao Mei,et al.  X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Marcella Cornia,et al.  Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Zhe Gan,et al.  TIGEr: Text-to-Image Grounding for Image Caption Evaluation , 2019, EMNLP.

[17]  Yu-Wing Tai,et al.  Reflective Decoding Network for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[19]  Jie Chen,et al.  Attention on Attention for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[23]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[24]  Albert Gatt,et al.  Pre-gen metrics: Predicting caption quality metrics without generating captions , 2018, ECCV Workshops.

[25]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[26]  Anastasia Shimorina,et al.  Human vs Automatic Metrics: on the Importance of Correlation Design , 2018, ArXiv.

[27]  Christoph Meinel,et al.  Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning , 2018, ACM Trans. Multim. Comput. Commun. Appl..

[28]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[29]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[30]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[33]  Feng Liu,et al.  Actor-Critic Sequence Training for Image Captioning , 2017, ArXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Ning Zhang,et al.  Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[37]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yusuke Sugano,et al.  Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[41]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[42]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[43]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[44]  Jian Sun,et al.  Rich Image Captioning in the Wild , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[46]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures (Extended Abstract) , 2017, IJCAI.

[48]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[49]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[51]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[53]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[54]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[55]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[60]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[61]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[62]  Micha Elsner,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2014 .

[63]  Frank Keller,et al.  Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.

[64]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[65]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[66]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[67]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[68]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[69]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[70]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[71]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[72]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[73]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[74]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[75]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[76]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[77]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Image Captioning , 2021, ArXiv.

[78]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[79]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .