MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization

State-of-the-art summarization systems can generate highly fluent summaries. These summaries, however, may contain factual inconsistencies and/or information not present in the source. Hence, an important component of assessing the quality of summaries is to determine whether there is information consistency between the source and the summary. Existing approaches are typically based on lexical matching or representation-based methods. In this work, we introduce an alternative scheme based on standard information-theoretic measures in which the information present in the source and summary is directly compared. We propose a Multiple-choice Question Answering and Generation framework, MQAG, which approximates the information consistency by computing the expected KL-divergence between summary and source answer distributions over automatically generated multiple-choice questions. This approach exploits multiple-choice answer probabilities, as predicted answer distributions can be easily compared. We conduct experiments on four summary evaluation datasets: QAG-CNNDM/XSum, XSum-Faithfulness, Podcast Assessment, and SummEval. Experiments show that MQAG (using models trained on RACE) outperforms existing evaluation methods on the majority of tasks.

[1]  M. Gales,et al.  Podcast Summary Assessment: A Resource for Evaluating Summary Assessment Methods , 2022, ArXiv.

[2]  D. Roth,et al.  Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics , 2022, FINDINGS.

[3]  Pascale Fung,et al.  Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[4]  Richard Yuanzhe Pang,et al.  QuALITY: Question Answering with Long Input Texts, Yes! , 2021, NAACL.

[5]  Yinfei Yang,et al.  SueNes: A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling , 2020, NAACL.

[6]  Ramesh Nallapati,et al.  Improving Factual Consistency of Abstractive Summarization via Question Answering , 2021, ACL.

[7]  Bing Qin,et al.  The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey , 2021, ArXiv.

[8]  D. Roth,et al.  Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary , 2020, Transactions of the Association for Computational Linguistics.

[9]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[10]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[11]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[12]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[13]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[14]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[15]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[16]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[17]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[20]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[21]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.