Pre-gen metrics: Predicting caption quality metrics without generating captions

Image caption generation systems are typically evaluated against reference outputs. We show that it is possible to predict output quality without generating the captions, based on the probability assigned by the neural model to the reference captions. Such pre-gen metrics are strongly correlated to standard evaluation metrics.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[3]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[4]  Albert Gatt,et al.  Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges , 2010, Empirical Methods in Natural Language Generation.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[8]  Frank Keller,et al.  Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.

[9]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[10]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[11]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Emiel Krahmer,et al.  Introducing shared task evaluation to NLG : The TUNA shared task evaluation challenges , 2010 .

[15]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[16]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[17]  Albert Gatt,et al.  Where to put the image in an image caption generator , 2017, Natural Language Engineering.

[18]  Julia Hockenmaier,et al.  Sentence-Based Image Description with Scalable, Explicit Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Aoife Cahill Correlating Human and Automatic Evaluation of a German Surface Realiser , 2009, ACL/IJCNLP.

[20]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[21]  Michael White,et al.  Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.

[22]  Emiel Krahmer,et al.  Sentence Simplification by Monolingual Machine Translation , 2012, ACL.

[23]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[24]  K. Bretonnel Cohen,et al.  Intrinsic Evaluation of Text Mining Tools May Not Predict Performance on Realistic Tasks , 2007, Pacific Symposium on Biocomputing.

[25]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[26]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[28]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.