SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization

We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations (e.g. preferences, ratings, etc.). We propose SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques. Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18-39%. Furthermore, we use SUPERT as rewards to guide a neural-based reinforcement learning summarizer, yielding favorable performance compared to the state-of-the-art unsupervised summarizers. All source code is available at this https URL.

[1]  Noah A. Smith,et al.  Extractive Summarization by Maximizing Semantic Volume , 2015, EMNLP.

[2]  P. V. S. Avinesh,et al.  Joint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback , 2017, ACL.

[3]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[4]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[5]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Ani Nenkova,et al.  The Feasibility of Embedding Based Automatic Evaluation for Single Document Summarization , 2019, EMNLP.

[8]  Qian Yang,et al.  PEAK: Pyramid Evaluation via Automated Knowledge Extraction , 2016, AAAI.

[9]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[10]  Iryna Gurevych,et al.  Preference-based interactive multi-document summarisation , 2019, Information Retrieval Journal.

[11]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[12]  Mirella Lapata,et al.  Sentence Centrality Revisited for Unsupervised Summarization , 2019, ACL.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  Sylvain Lamprier,et al.  Answers Unite! Unsupervised Metrics for Reinforced Summarization Models , 2019, EMNLP.

[15]  Markus Zopf,et al.  Estimating Summary Quality with Pairwise Preferences , 2018, NAACL.

[16]  Takeshi Abekawa,et al.  Framework of Automatic Text Summarization Using Reinforcement Learning , 2012, EMNLP-CoNLL.

[17]  Masaaki Nagata,et al.  Automatic Pyramid Evaluation Exploiting EDU-based Extractive Reference Summaries , 2018, EMNLP.

[18]  Eduard Hovy,et al.  Earlier Isn’t Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization , 2019, EMNLP.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Iryna Gurevych,et al.  APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning , 2018, EMNLP.

[21]  Maxime Peyrard,et al.  A Simple Theoretical Model of Importance for Summarization , 2018, ACL.

[22]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[23]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[24]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[25]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[26]  Ido Dagan,et al.  Better Rewards Yield Better Summaries: Learning to Summarise Without References , 2019, EMNLP.

[27]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[28]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[29]  Fang Chen,et al.  A Graph-theoretic Summary Evaluation for ROUGE , 2018, EMNLP.

[30]  Fang Chen,et al.  Summarization Evaluation in the Absence of Human Model Summaries Using the Compositionality of Word Embeddings , 2018, COLING.

[31]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[32]  Iryna Gurevych,et al.  Learning to Score System Summaries for Better Content Selection Evaluation. , 2017, NFiS@EMNLP.

[33]  Ani Nenkova,et al.  Automatically Assessing Machine Summary Content Without a Gold Standard , 2013, CL.

[34]  Chen Sun,et al.  Automated Pyramid Summarization Evaluation , 2019, CoNLL.

[35]  Iryna Gurevych,et al.  Objective Function Learning to Match Human Judgements for Optimization-Based Summarization , 2018, NAACL.

[36]  Chin-Yew Lin,et al.  Looking for a Few Good Metrics: ROUGE and its Evaluation , 2004 .

[37]  Sadid A. Hasan,et al.  Fear the REAPER: A System for Automatic Multi-Document Summarization with Reinforcement Learning , 2014, EMNLP.

[38]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[39]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.