论文信息 - Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Abstract A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary’s information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval outperforms current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.1

Dan Roth | Daniel Deutsch | Tania Bedrax-Weiss

[1] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[2] Fei Liu,et al. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[3] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[4] Ani Nenkova,et al. Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[5] Mirella Lapata,et al. Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[6] Eduard H. Hovy,et al. Summarization Evaluation Using Transformed Basic Elements , 2008, TAC.

[7] Percy Liang,et al. Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[8] Ani Nenkova,et al. Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[9] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[10] Sylvain Lamprier,et al. Answers Unite! Unsupervised Metrics for Reinforced Summarization Models , 2019, EMNLP.

[11] Iryna Gurevych,et al. Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.