An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters

Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their patient encounters (Hripcsak et al., 2011). Reducing this workload calls for relevant and efficient summarization methods. In this paper, we introduce new resources and empirical investigations for the automatic summarization of doctor-patient conversations in a clinical setting. In particular, we introduce the MTS-Dialog dataset; a new collection of 1,700 doctor-patient dialogues and corresponding clinical notes. We use this new dataset to investigate the feasibility of this task and the relevance of existing language models, data augmentation, and guided summarization techniques. We compare standard evaluation metrics based on n-gram matching, contextual embeddings, and Fact Extraction to assess the accuracy and the factual consistency of the generated summaries. To ground these results, we perform an expert-based evaluation using relevant natural language generation criteria and task-specific criteria such as critical omissions, and study the correlation between the automatic metrics and expert judgments. To the best of our knowledge, this study is the first attempt to introduce an open dataset of doctor-patient conversations and clinical notes, with detailed automated and manual evaluations of clinical note generation.

[1]  Alex Papadopoulos Korfiatis,et al.  User-Driven Research of Medical Note Generation Software , 2022, NAACL-HLT.

[2]  Alex Papadopoulos Korfiatis,et al.  Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation , 2022, ACL.

[3]  Alex Papadopoulos Korfiatis,et al.  PriMock57: A Dataset Of Primary Care Mock Consultations , 2022, ACL.

[4]  Andrea Madotto,et al.  Dialogue Summaries as Dialogue States (DS2), Template-Guided Summarization for Few-shot Dialogue State Tracking , 2022, FINDINGS.

[5]  Thomas Lin,et al.  MedicalSum: A Guided Clinical Abstractive Summarization Model for Generating Medical Reports from Patient-Doctor Conversations , 2022, EMNLP.

[6]  Vitalii Zhelezniak,et al.  Towards more patient friendly clinical notes through language models and ontologies , 2021, AMIA.

[7]  Thomas Schaaf,et al.  Leveraging Pretrained Models for Automatic Summarization of Doctor-Patient Conversations , 2021, EMNLP.

[8]  Wen-wai Yim,et al.  Towards Automating Medical Scribing : Clinic Visit Dialogue2Note Sentence Alignment and Snippet Summarization , 2021, NLPMC.

[9]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[10]  Jeffrey P. Bigham,et al.  Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques , 2020, ACL.

[11]  Dina Demner-Fushman,et al.  Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain , 2021, BIONLP.

[12]  Fei Xia,et al.  Summarizing Medical Conversations via Identifying Important Utterances , 2020, COLING.

[13]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[14]  Jörg Tiedemann,et al.  OPUS-MT – Building open translation services for the World , 2020, EAMT.

[15]  Xiaodan Liang,et al.  MedDG: A Large-scale Medical Consultation Dataset for Building Medical Dialogue System , 2020, ArXiv.

[16]  Xavier Amatriain,et al.  Dr. Summarize: Global Summarization of Medical Dialogue by Exploiting Local Structures. , 2020, FINDINGS.

[17]  Yuanzhe Zhang,et al.  MIE: A Medical Information Extractor towards Medical Dialogues , 2020, ACL.

[18]  Gagandeep Singh,et al.  Generating Medical Reports from Patient-Doctor Conversations Using Sequence-to-Sequence Models , 2020, NLPMC.

[19]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[20]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[21]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[22]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[23]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[24]  Zhongyu Wei,et al.  Enhancing Dialogue Symptom Diagnosis with Global Attention and Symptom Graph , 2019, EMNLP.

[25]  Nancy F. Chen,et al.  Topic-Aware Pointer-Generator Networks for Summarizing Spoken Conversations , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Heng Ji,et al.  Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization , 2019, ACL.

[27]  Shashi Narayan,et al.  HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.

[28]  Ben Goodrich,et al.  Assessing The Factual Accuracy of Generated Text , 2019, KDD.

[29]  Indika Kahanda,et al.  Automatically Generating Psychiatric Case Notes From Digital Transcripts of Doctor-Patient Conversations , 2019, PeerJ Prepr..

[30]  Yun-Nung Chen,et al.  Abstractive Dialogue Summarization with Sentence-Gated Modeling Optimized by Dialogue Acts , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[31]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[32]  Brian G. Arndt,et al.  Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations , 2017, The Annals of Family Medicine.

[33]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[34]  Stéphane M. Meystre,et al.  Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text , 2014, J. Biomed. Informatics.

[35]  George Hripcsak,et al.  Use of electronic clinical documentation: time spent and team interactions , 2011, J. Am. Medical Informatics Assoc..

[36]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[37]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..