Measuring Attribution in Natural Language Generation Models

Abstract Large neural models have brought a new challenge to natural language generation (NLG): It has become imperative to ensure the safety and reliability of the output of models that generate freely. To this end, we present an evaluation framework, Attributable to Identified Sources (AIS), stipulating that NLG output pertaining to the external world is to be verified against an independent, provided source. We define AIS and a two-stage annotation pipeline for allowing annotators to evaluate model output according to annotation guidelines. We successfully validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset). We provide full annotation guidelines in the appendices and publicly release the annotated data at https://github.com/google-research-datasets/AIS.

[1]  Wenhao Liu,et al.  DialFact: A Benchmark for Fact-Checking in Dialogue , 2021, ACL.

[2]  K. McKeown,et al.  Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization , 2021, ACL.

[3]  David Reitter,et al.  Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Features , 2021, ACL.

[4]  Ramesh Nallapati,et al.  Improving Factual Consistency of Abstractive Summarization via Question Answering , 2021, ACL.

[5]  D. Reitter,et al.  Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark , 2021, Transactions of the Association for Computational Linguistics.

[6]  Idan Szpektor,et al.  Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering , 2021, EMNLP.

[7]  Eunsol Choi,et al.  Decontextualization: Making Sentences Stand-Alone , 2021, Transactions of the Association for Computational Linguistics.

[8]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[9]  Simon Mille,et al.  Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing , 2020, INLG.

[10]  Zhucheng Tu,et al.  Open-Domain Question Answering Goes Conversational via Question Rewriting , 2020, NAACL.

[11]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[12]  Jamie Callan,et al.  CAsT-19: A Dataset for Conversational Information Seeking , 2020, SIGIR.

[13]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[14]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[15]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[16]  Diyi Yang,et al.  ToTTo: A Controlled Table-To-Text Generation Dataset , 2020, EMNLP.

[17]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[18]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[19]  J. Weston,et al.  The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents , 2019, ACL.

[20]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[21]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[23]  Mohit Bansal,et al.  Combining Fact Extraction and Verification with Neural Semantic Matching Networks , 2018, AAAI.

[24]  Jason Weston,et al.  Dialogue Natural Language Inference , 2018, ACL.

[25]  J. Weston,et al.  Wizard of Wikipedia: Knowledge-Powered Conversational agents , 2018, ICLR.

[26]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[27]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[28]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[29]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[30]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[31]  Carlos Guestrin,et al.  Model-Agnostic Interpretability of Machine Learning , 2016, ArXiv.

[32]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[33]  A. Gopnik,et al.  Why the Child's Theory of Mind Really Is a Theory , 1992 .

[34]  H. Friedman,et al.  Harvey Friedman's Research on the Foundations of Mathematics , 1985 .

[35]  Iryna Gurevych,et al.  Evidence-based Verification for Real World Information Needs , 2021, ArXiv.

[36]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[37]  Siobhan Chapman Logic and Conversation , 2005 .

[38]  Robyn Carston,et al.  Implicature, explicature and truth-theoretic semantics , 1998 .