WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning

A crucial issue of current text generation models is that they often uncontrollably generate text that is factually inconsistent with inputs.Due to lack of annotated data, existing factual consistency metrics usually train evaluation models on synthetic texts or directly transfer from other related tasks, such as question answering (QA) and natural language inference (NLI).Bias in synthetic text or upstream tasks makes them perform poorly on text actually generated by language models, especially for general evaluation for various tasks.To alleviate this problem, we propose a weakly supervised framework named WeCheck that is directly trained on actual generated samples from language models with weakly annotated labels.WeCheck first utilizes a generative model to infer the factual labels of generated samples by aggregating weak labels from multiple resources.Next, we train a simple noise-aware classification model as the target metric using the inferred weakly supervised information.Comprehensive experiments on various tasks demonstrate the strong performance of WeCheck, achieving an average absolute improvement of 3.3% on the TRUE benchmark over 11B state-of-the-art methods using only 435M parameters.Furthermore, it is up to 30 times faster than previous evaluation methods, greatly improving the accuracy and efficiency of factual consistency evaluation.

[1]  Ziqiang Cao,et al.  FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness , 2022, EMNLP.

[2]  Sujian Li,et al.  Precisely the Point: Adversarial Augmentations for Faithful and Informative Text Generation , 2022, EMNLP.

[3]  Junyi Jessy Li,et al.  Evaluating Factuality in Text Simplification , 2022, ACL.

[4]  Y. Matias,et al.  TRUE: Re-evaluating Factual Consistency Evaluation , 2022, NAACL.

[5]  Jiachen Liu,et al.  Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods , 2022, ArXiv.

[6]  Alexander R. Fabbri,et al.  QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization , 2021, NAACL.

[7]  Weizhu Chen,et al.  DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing , 2021, ICLR.

[8]  Paul N. Bennett,et al.  SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization , 2021, TACL.

[9]  Wenhao Liu,et al.  DialFact: A Benchmark for Fact-Checking in Dialogue , 2021, ACL.

[10]  Artur Dubrawski,et al.  End-to-End Weak Supervision , 2021, NeurIPS.

[11]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[12]  Kartik Talamadupula,et al.  Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization , 2021, NAACL.

[13]  Artidoro Pagnoni,et al.  Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[14]  Idan Szpektor,et al.  Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.

[15]  Samuel R. Bowman,et al.  Does Putting a Linguist in the Loop Improve NLU Data Collection? , 2021, EMNLP.

[16]  Tanya Goyal,et al.  Annotating and Modeling Fine-grained Factuality in Summarization , 2021, NAACL.

[17]  Sylvain Lamprier,et al.  QuestEval: Summarization Asks for Fact-based Evaluation , 2021, EMNLP.

[18]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[19]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[20]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[21]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[22]  Sunita Sarawagi,et al.  Learning from Rules Generalizing Labeled Exemplars , 2020, ICLR.

[23]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[24]  Christopher R'e,et al.  Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods , 2020, ICML.

[25]  J. Weston,et al.  The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents , 2019, ACL.

[26]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[27]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[28]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[29]  Danqi Chen,et al.  A Discrete Hard EM Approach for Weakly Supervised Question Answering , 2019, EMNLP.

[30]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[31]  Ido Dagan,et al.  Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[32]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[33]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[34]  Jason Weston,et al.  Dialogue Natural Language Inference , 2018, ACL.

[35]  Frederic Sala,et al.  Training Complex Models with Multi-Task Weak Supervision , 2018, AAAI.

[36]  J. Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[37]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[38]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[39]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[40]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[41]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[42]  Christopher Ré,et al.  Learning the Structure of Generative Models without Labeled Data , 2017, ICML.

[43]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[44]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[45]  Simon Parsons,et al.  Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman, MIT Press, 1231 pp., $95.00, ISBN 0-262-01319-3 , 2011, The Knowledge Engineering Review.

[46]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[47]  Shi Yu,et al.  Named Entity Recognition through Deep Representation Learning and Weak Supervision , 2021, FINDINGS.

[48]  David Reitter,et al.  Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark , 2021, Transactions of the Association for Computational Linguistics.

[49]  Robert Pasero,et al.  A Dialogue in Natural Language , 1982, ICLP.