The reporting quality of natural language processing studies: systematic review of studies of radiology reports

Background Automated language analysis of radiology reports using natural language processing (NLP) can provide valuable information on patients’ health and disease. With its rapid development, NLP studies should have transparent methodology to allow comparison of approaches and reproducibility. This systematic review aims to summarise the characteristics and reporting quality of studies applying NLP to radiology reports. Methods We searched Google Scholar for studies published in English that applied NLP to radiology reports of any imaging modality between January 2015 and October 2019. At least two reviewers independently performed screening and completed data extraction. We specified 15 criteria relating to data source, datasets, ground truth, outcomes, and reproducibility for quality assessment. The primary NLP performance measures were precision, recall and F1 score. Results Of the 4,836 records retrieved, we included 164 studies that used NLP on radiology reports. The commonest clinical applications of NLP were disease information or classification (28%) and diagnostic surveillance (27.4%). Most studies used English radiology reports (86%). Reports from mixed imaging modalities were used in 28% of the studies. Oncology (24%) was the most frequent disease area. Most studies had dataset size > 200 (85.4%) but the proportion of studies that described their annotated, training, validation, and test set were 67.1%, 63.4%, 45.7%, and 67.7% respectively. About half of the studies reported precision (48.8%) and recall (53.7%). Few studies reported external validation performed (10.8%), data availability (8.5%) and code availability (9.1%). There was no pattern of performance associated with the overall reporting quality. Conclusions There is a range of potential clinical applications for NLP of radiology reports in health services and research. However, we found suboptimal reporting quality that precludes comparison, reproducibility, and replication. Our results support the need for development of reporting standards specific to clinical NLP studies.

[1]  Maria Liakata,et al.  Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances , 2018, J. Biomed. Informatics.

[2]  Elmar Kotter,et al.  Ethics of artificial intelligence in radiology: summary of the joint European and North American multisociety statement , 2019, Insights into Imaging.

[3]  Poppy Noor,et al.  Can we trust AI not to further embed racial bias and prejudice? , 2020, BMJ.

[4]  Shravya Shetty,et al.  Reply to: Transparency and reproducibility in artificial intelligence , 2020, Nature.

[5]  Gary S Collins,et al.  Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension , 2020, Nature Medicine.

[6]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[7]  Nabile M. Safdar,et al.  Ethics of Artificial Intelligence in Radiology: Summary of the Joint European and North American Multisociety Statement. , 2019, Radiology.

[8]  M. Howell,et al.  Ensuring Fairness in Machine Learning to Advance Health Equity , 2018, Annals of Internal Medicine.

[9]  Trevor Hastie,et al.  Transparency and reproducibility in artificial intelligence , 2020, Nature.

[10]  J. Ioannidis,et al.  Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies , 2020, BMJ.

[11]  Gary S. Collins,et al.  Reporting of artificial intelligence prediction models , 2019, The Lancet.

[12]  Loes M. M. Braun,et al.  Natural Language Processing in Radiology: A Systematic Review. , 2016, Radiology.

[13]  Kathryn J Fowler,et al.  Assessing Radiology Research on Artificial Intelligence: A Brief Guide for Authors, Reviewers, and Readers-From the Radiology Editorial Board. , 2019, Radiology.

[14]  Steven Horng,et al.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , 2019, Scientific Data.

[15]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[16]  Gary S Collins,et al.  Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension , 2020, BMJ.

[17]  Stéfan Jacques Darmoni,et al.  Is the coverage of google scholar enough to be used alone for systematic reviews , 2013, BMC Medical Informatics and Decision Making.

[18]  Alan L. Mackay,et al.  Publish or perish , 1974, Nature.

[19]  Cecilia S. Lee,et al.  Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. , 2020, The Lancet. Digital health.

[20]  Philip Smith,et al.  Reporting quality of studies using machine learning models for medical diagnosis: a systematic review , 2020, BMJ Open.

[21]  Vivienne J. Zhu,et al.  Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke , 2019, PloS one.

[22]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[23]  Dimitrios Mitsouras,et al.  Natural Language Processing Technologies in Radiology Research and Clinical Applications. , 2016, Radiographics : a review publication of the Radiological Society of North America, Inc.

[24]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. , 2010, International journal of surgery.

[25]  Claire Grover,et al.  A systematic review of natural language processing applied to radiology reports , 2021, BMC Medical Informatics and Decision Making.

[26]  Aaron Y. Lee,et al.  Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension , 2020, Nature Medicine.

[27]  Cynthia Brandt,et al.  Classification of radiology reports for falls in an HIV study cohort , 2016, J. Am. Medical Informatics Assoc..

[28]  Matthias Egger,et al.  The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: Guidelines for Reporting Observational Studies , 2007, PLoS medicine.

[29]  Douglas G Altman,et al.  The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: guidelines for reporting observational studies. , 2014, International journal of surgery.

[30]  D. Moher,et al.  Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement , 2009, BMJ : British Medical Journal.

[31]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[32]  Trish Groves,et al.  Enhancing the quality and transparency of health research , 2008, BMJ : British Medical Journal.

[33]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..