Learning to Rank Visual Stories From Human Ranking Data

Visual storytelling (VIST) is a typical vision and language task that has seen extensive development in the natural language generation research domain. However, it remains unclear whether conventional automatic evaluation metrics for text generation are applicable on VIST. In this paper, we present the VHED (VIST Human Evaluation Data) dataset, which first re-purposes human evaluation results for automatic evaluation; hence we develop Vrank (VIST Ranker), a novel reference-free VIST metric for story evaluation. We first show that the results from commonly adopted automatic metrics for text generation have little correlation with those obtained from human evaluation, which motivates us to directly utilize human evaluation results to learn the automatic evaluation model. In the experiments, we evaluate the generated texts to predict story ranks using our model as well as other reference-based and reference-free metrics. Results show that Vrank prediction is significantly more aligned to human evaluation than other metrics with almost 30% higher accuracy when ranking story pairs. Moreover, we demonstrate that only Vrank shows human-like behavior in its strong ability to find better stories when the quality gap between two stories is high. Finally, we show the superiority of Vrank by its generalizability to pure textual stories, and conclude that this reuse of human evaluation results puts Vrank in a strong position for continued future advances.

[1]  Tal August,et al.  All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.

[2]  Robin Jia,et al.  The statistical advantage of automatic NLG metrics at the system level , 2021, ACL.

[3]  Minlie Huang,et al.  OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics , 2021, ACL.

[4]  Lun-Wei Ku,et al.  Plot and Rework: Modeling Storylines for Visual Storytelling , 2021, FINDINGS.

[5]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[6]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[7]  Hiroya Takamura,et al.  Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling , 2021, AAAI.

[8]  William Yang Wang,et al.  Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations , 2020, EMNLP.

[9]  Minlie Huang,et al.  UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation , 2020, EMNLP.

[10]  Maksym Andriushchenko,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ICLR.

[11]  Joelle Pineau,et al.  Learning an Unreferenced Metric for Online Dialogue Evaluation , 2020, ACL.

[12]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[13]  Piji Li,et al.  Storytelling from an Image Stream Using Scene Graphs , 2020, AAAI.

[14]  Lun-Wei Ku,et al.  Knowledge-Enriched Visual Storytelling , 2019, AAAI.

[15]  Piji Li,et al.  Keep it Consistent: Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication , 2019, COLING.

[16]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Jianfeng Gao,et al.  What Makes A Good Story? Designing Composite Rewards for Visual Storytelling , 2019, AAAI.

[18]  Chieh-Yang Huang,et al.  Visual Story Post-Editing , 2019, ACL.

[19]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[20]  Natalie Parde,et al.  The Steep Road to Happily Ever after: an Analysis of Current Visual Storytelling Models , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[21]  Trevor Darrell,et al.  Object Hallucination in Image Captioning , 2018, EMNLP.

[22]  Byoung-Tak Zhang,et al.  GLAC Net: GLocal Attention Cascading Networks for Multi-image Cued Story Generation , 2018, ArXiv.

[23]  Zhe Gan,et al.  Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[24]  Xin Wang,et al.  No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling , 2018, ACL.

[25]  Jing Wang,et al.  Show, Reward and Tell: Automatic Generation of Narrative Paragraph From Photo Stream by Adversarial Training , 2018, AAAI.

[26]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[27]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[28]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[30]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Ting-Hao 'Kenneth' Huang,et al.  Stretch-VST: Getting Flexible With Visual Stories , 2021, ACL.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Wjf Keenan,et al.  Sacre bleu: faith, fashion and freedom: Marist foundation garments 1817 - 1862 , 2006 .