Measuring the Diversity of Automatic Image Descriptions

Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[3]  Piek T. J. M. Vossen,et al.  Addressing the MFS Bias in WSD systems , 2016, LREC.

[4]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[5]  Jorma Laaksonen,et al.  Paying Attention to Descriptions Generated by Image Captioning Models , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[7]  Geoffrey Zweig,et al.  Language Models for Image Captioning: The Quirks and What Works , 2015, ACL.

[8]  C. Allen,et al.  Stanford Encyclopedia of Philosophy , 2011 .

[9]  Francis Ferraro,et al.  A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Emiel Krahmer,et al.  Towards more variation in text generation: Developing and evaluating variation models for choice of referential form , 2016, ACL.

[12]  G. Youmans,et al.  Measuring Lexical Style and Competence: The Type-Token Vocabulary Curve , 1990 .

[13]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[15]  Emiel van Miltenburg Stereotyping and Bias in the Flickr30K Dataset , 2016, ArXiv.

[16]  Marco Baroni,et al.  Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks , 2017, ICLR 2018.

[17]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[18]  Frank Keller,et al.  Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.

[19]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Lou Boves,et al.  On the difficulty of making concreteness concrete , 2011 .

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[23]  Yueting Zhuang,et al.  Diverse Image Captioning via GroupTalk , 2016, IJCAI.

[24]  Roser Morante,et al.  Pragmatic Factors in Image Description: The Case of Negations , 2016, VL@ACL.

[25]  Rudolph W. Schulz,et al.  Parameters of abstraction, meaningfulness, and pronunciability for 329 nouns , 1966 .

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Marc Brysbaert,et al.  Subtlex-UK: A New and Improved Word Frequency Database for British English , 2014, Quarterly journal of experimental psychology.

[28]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[30]  Victor H. Yngve,et al.  A model and an hypothesis for language structure , 1960 .

[31]  Fuchun Sun,et al.  MAT: A Multimodal Attentive Translator for Image Captioning , 2017, IJCAI.

[32]  Gemma Boleda,et al.  Zipf’s Law for Word Frequencies: Word Forms versus Lemmas in Long Texts , 2014, PloS one.

[33]  Jorma Laaksonen,et al.  Exploiting Scene Context for Image Captioning , 2016, iV&L-MM@MM.

[34]  Bohyung Han,et al.  Text-Guided Attention Model for Image Captioning , 2016, AAAI.

[35]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[36]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  D. Hansen A PROGRAM OF RESEARCH , 2005 .

[38]  Barry K. Rosen,et al.  Syntactic Complexity , 1974, Inf. Control..