Fill in the BLANC: Human-free quality estimation of document summaries

We present BLANC, a new approach to the automatic estimation of document summary quality. Our goal is to measure the functional performance of a summary with an objective, reproducible, and fully automated method. Our approach achieves this by measuring the performance boost gained by a pre-trained language model with access to a document summary while carrying out its language understanding task on the document’s text. We present evidence that BLANC scores have as good correlation with human evaluations as do the ROUGE family of summary quality measurements. And unlike ROUGE, the BLANC method does not require human-written reference summaries, allowing for fully human-free summary quality estimation.

[1]  Wei Zhao,et al.  SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization , 2020, ACL.

[2]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[3]  Lu Wang,et al.  Robust Neural Abstractive Summarization Systems and Evaluation against Adversarial Information , 2018, ArXiv.

[4]  Xiang Ren,et al.  Facet-Aware Evaluation for Extractive Summarization , 2020, ACL.

[5]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[6]  John Bohannon,et al.  Headline Generation: Learning from Decomposed Document Titles , 2019, ArXiv.

[7]  Kavita Ganesan,et al.  ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks , 2015, ArXiv.

[8]  Cong Yu,et al.  Generating Representative Headlines for News Stories , 2020, WWW.

[9]  Ani Nenkova,et al.  Automatically Evaluating Content Selection in Summarization without Human Models , 2009, EMNLP.

[10]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[11]  Maosong Sun,et al.  Neural Headline Generation with Sentence-wise Optimization , 2016 .

[12]  Marianna Apidianaki,et al.  SUM-QE: a BERT-based Summary Quality Estimation Model , 2019, EMNLP.

[13]  Ani Nenkova,et al.  Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[14]  Alec Radford,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Michael Elhadad,et al.  Question Answering as an Automatic Evaluation Metric for News Article Summarization , 2019, NAACL.

[17]  Tarek El-Shishtawy,et al.  Keyphrase Based Evaluation of Automatic Text Summarization , 2015, ArXiv.

[18]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[19]  Shashi Narayan,et al.  HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.

[20]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[21]  Nazli Goharian,et al.  Revisiting Summarization Evaluation for Scientific Articles , 2016, LREC.

[22]  Peng Xu,et al.  A Novel Repetition Normalized Adversarial Reward for Headline Generation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Liyuan Liu,et al.  Facet-Aware Evaluation for Extractive Text Summarization , 2019, ArXiv.

[24]  Naoaki Okazaki,et al.  Source-side Prediction for Neural Headline Generation , 2017, ArXiv.

[25]  Fei Wu,et al.  A Semantic QA-Based Approach for Text Summarization Evaluation , 2017, AAAI.

[26]  Dragomir R. Radev,et al.  LexRank: Graph-based Centrality as Salience in Text Summarization , 2004 .

[27]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[28]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[30]  Sylvain Lamprier,et al.  Answers Unite! Unsupervised Metrics for Reinforced Summarization Models , 2019, EMNLP.

[31]  Jie Wang,et al.  Efficient and Effective Single-Document Summarizations and a Word-Embedding Measurement of Quality , 2017, KDIR.

[32]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[33]  Wilson L. Taylor,et al.  “Cloze Procedure”: A New Tool for Measuring Readability , 1953 .

[34]  John Bohannon,et al.  Headline Generation: Learning from Decomposable Document Titles , 2019 .