GO FIGURE: A Meta Evaluation of Factuality in Summarization

While neural language models can generate text with remarkable fluency and coherence, controlling for factual correctness in generation remains an open research question. This major discrepancy between the surface-level fluency and the content-level correctness of neural generation has motivated a new line of research that seeks automatic metrics for evaluating the factuality of machine text. In this paper, we introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. We propose five necessary and intuitive conditions to evaluate factuality metrics on diagnostic factuality data across three different summarization tasks. Our benchmark analysis on ten factuality metrics reveals that our meta-evaluation framework provides a robust and efficient evaluation that is extensible to multiple types of factual consistency and standard generation metrics, including QA metrics. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.

[1]  Shay B. Cohen,et al.  Reducing the Frequency of Hallucinated Quantities in Abstractive Summaries , 2020, FINDINGS.

[2]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[3]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[4]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[5]  Lingfei Wu,et al.  Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward , 2020, ACL.

[6]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[7]  Ronan Le Bras,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[8]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[9]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[10]  Chenguang Zhu,et al.  Boosting Factual Correctness of Abstractive Summarization , 2020 .

[11]  Xuedong Huang,et al.  Boosting Factual Correctness of Abstractive Summarization with Knowledge Graph , 2020, ArXiv.

[12]  John Bohannon,et al.  Fill in the BLANC: Human-free quality estimation of document summaries , 2020, EVAL4NLP.

[13]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[14]  Fabio Petroni,et al.  How Decoding Strategies Affect the Verifiability of Generated Text , 2019, FINDINGS.

[15]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Sylvain Lamprier,et al.  Answers Unite! Unsupervised Metrics for Reinforced Summarization Models , 2019, EMNLP.

[18]  Jiawei Han,et al.  Facet-Aware Evaluation for Extractive Summarization , 2019, ACL.

[19]  Aliaksei Severyn,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[20]  Michael Elhadad,et al.  Question Answering as an Automatic Evaluation Metric for News Article Summarization , 2019, NAACL.

[21]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[22]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[23]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[24]  Ido Dagan,et al.  Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[25]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[26]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[27]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[28]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[29]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[30]  Arun Tejasvi Chaganty,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[31]  Ramakanth Pasunuru,et al.  Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation , 2018, ACL.

[32]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[33]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[34]  Furu Wei,et al.  Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[35]  Verena Rieser,et al.  Referenceless Quality Estimation for Natural Language Generation , 2017, ArXiv.

[36]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[37]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[38]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[39]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[40]  Francis M. Tyers,et al.  Evaluating machine translation for assimilation via a gap-filling task , 2015, EAMT.

[41]  John M. Conroy,et al.  A Decade of Automatic Content Evaluation of News Summaries: Reassessing the State of the Art , 2013, ACL.

[42]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[43]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[44]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[45]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[46]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[47]  Francis M. Tyers,et al.  Evaluating machine translation for assimilation via a gap-filling task , 2015, European Association for Machine Translation Conferences/Workshops.