Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports

Neural abstractive summarization models are able to generate summaries which have high overlap with human references. However, existing models are not optimized for factual correctness, a critical metric in real-world applications. In this work, we develop a general framework where we evaluate the factual correctness of a generated summary by fact-checking it automatically against its reference using an information extraction module. We further propose a training strategy which optimizes a neural summarization model with a factual correctness reward via reinforcement learning. We apply the proposed method to the summarization of radiology reports, where factual correctness is a key requirement. On two separate datasets collected from hospitals, we show via both automatic and human evaluation that the proposed approach substantially improves the factual correctness and overall quality of outputs over a competitive neural summarization system, producing radiology summaries that approach the quality of human-authored ones.

[1]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[2]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[3]  Ankur Parikh,et al.  Handling Divergent Reference Texts when Evaluating Table-to-Text Generation , 2019, ACL.

[4]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[5]  Ido Dagan,et al.  Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference , 2019, ACL.

[6]  Nazli Goharian,et al.  Ontology-Aware Clinical Abstractive Summarization , 2019, SIGIR.

[7]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[8]  Peter Szolovits,et al.  Clinically Accurate Chest X-Ray Report Generation , 2019, MLHC.

[9]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[10]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[11]  Christopher D. Manning,et al.  Learning to Summarize Radiology Findings , 2018, Louhi@EMNLP.

[12]  Arun Tejasvi Chaganty,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[13]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[14]  Eric P. Xing,et al.  Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation , 2018, NeurIPS.

[15]  Peter J. Liu,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[16]  Pengtao Xie,et al.  On the Automatic Generation of Medical Imaging Reports , 2017, ACL.

[17]  Furu Wei,et al.  Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[18]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[19]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[20]  Alexander M. Rush,et al.  Challenges in Data-to-Document Generation , 2017, EMNLP.

[21]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[22]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[23]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[25]  Alexander M. Rush,et al.  Abstractive Sentence Summarization with Attentive Recurrent Neural Networks , 2016, NAACL.

[26]  Mirella Lapata,et al.  Neural Summarization by Extracting Sentences and Words , 2016, ACL.

[27]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[28]  S. Chopra,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[29]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[30]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[35]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[36]  E. Burnside,et al.  Toward best practices in radiology reporting. , 2009, Radiology.

[37]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[38]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[39]  J. Austin,et al.  Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. , 2002, Radiology.

[40]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[41]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[42]  C. Langlotz,et al.  Information extraction from multi-institutional radiology reports , 2016, Artif. Intell. Medicine.

[43]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..