Calibrating Likelihoods towards Consistency in Summarization Models

Despite the recent advances in abstractive text summarization, current summarization models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. We argue that the main reason for such behavior is that the summarization models trained with maximum likelihood objective assign high probability to plausible sequences given the context, but they often do not accurately rank sequences by their consistency. In this work, we solve this problem by calibrating the likelihood of model generated sequences to better align with a consistency metric measured by natural language inference (NLI) models. The human evaluation study and automatic metrics show that the calibrated models generate more consistent and higher-quality summaries. We also show that the models trained using our method return probabilities that are better aligned with the NLI scores, which significantly increase reliability of summarization models.

[1]  Peter J. Liu,et al.  SLiC-HF: Sequence Likelihood Calibration with Human Feedback , 2023, ArXiv.

[2]  Mirella Lapata,et al.  mFACE: Multilingual Summarization with Factual Consistency Evaluation , 2022, ACL.

[3]  Peter J. Liu,et al.  Calibrating Sequence likelihood Improves Conditional Language Generation , 2022, ICLR.

[4]  Reinald Kim Amplayo,et al.  Conditional Generation with a Question-Answering Blueprint , 2022, Transactions of the Association for Computational Linguistics.

[5]  Justin F. Rousseau,et al.  Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors , 2022, ACL.

[6]  Mohit Bansal,et al.  FactPEGASUS: Factuality-Aware Pre-training and Fine-tuning for Abstractive Summarization , 2022, NAACL.

[7]  José G. C. de Souza,et al.  Quality-Aware Decoding for Neural Machine Translation , 2022, NAACL.

[8]  Y. Matias,et al.  TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.

[9]  Dragomir R. Radev,et al.  BRIO: Bringing Order to Abstractive Summarization , 2022, ACL.

[10]  Mirella Lapata,et al.  A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation , 2022, ACL.

[11]  Shafiq R. Joty,et al.  SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization , 2022, ACL.

[12]  Paul N. Bennett,et al.  SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization , 2021, TACL.

[13]  Lu Wang,et al.  CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization , 2021, EMNLP.

[14]  David Reitter,et al.  Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Features , 2021, ACL.

[15]  Yixin Liu,et al.  SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization , 2021, ACL.

[16]  Idan Szpektor,et al.  Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.

[17]  Gonçalo Simões,et al.  Planning with Learned Entity Prompts for Abstractive Summarization , 2021, Transactions of the Association for Computational Linguistics.

[18]  Tanya Goyal,et al.  Annotating and Modeling Fine-grained Factuality in Summarization , 2021, NAACL.

[19]  Sylvain Lamprier,et al.  QuestEval: Summarization Asks for Fact-based Evaluation , 2021, EMNLP.

[20]  Balaji Vasan Srinivasan,et al.  Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus , 2020, NAACL.

[21]  D. Roth,et al.  Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary , 2020, Transactions of the Association for Computational Linguistics.

[22]  Ryan J. Lowe,et al.  Learning to summarize from human feedback , 2020, NeurIPS 2020.

[23]  Haley Carson Dublin, Ireland , 2020, Catching the worm.

[24]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[25]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[26]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[27]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[28]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[30]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[31]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[32]  Ankur Parikh,et al.  Handling Divergent Reference Texts when Evaluating Table-to-Text Generation , 2019, ACL.

[33]  Ido Dagan,et al.  Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[34]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[35]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[36]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[37]  Ramakanth Pasunuru,et al.  Multi-Reward Reinforced Summarization with Saliency and Entailment , 2018, NAACL.

[38]  Ramakanth Pasunuru,et al.  Towards Improving Abstractive Summarization via Entailment Generation , 2017, NFiS@EMNLP.

[39]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[40]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[41]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[42]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[43]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[44]  Y. Matias,et al.  TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.

[45]  S. Yiu,et al.  Controllable Dictionary Example Generation: Generating Example Sentences for Specific Targeted Audiences , 2022, ACL.

[46]  Mohammad Saleh,et al.  ForumSum: A Multi-Speaker Conversation Summarization Dataset , 2021, EMNLP.

[47]  Marc'Aurelio Ranzato,et al.  Discriminative Reranking for Neural Machine Translation , 2021, ACL.

[48]  Miguel Ángel García Cumbreras,et al.  Association for Computational Linguistics , 2001 .