Selective prediction for extracting unstructured clinical data

Background: Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. Methods: We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. Findings: All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Interpretation: Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving "easy" charts to a model and "hard" charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.

[1]  Jessica L. Gronsbell,et al.  Machine learning approaches for electronic health records phenotyping: A methodical review , 2022, medRxiv.

[2]  Bill Yuchen Lin,et al.  RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models , 2021, EMNLP.

[3]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[4]  Jeewani Anupama Ginige,et al.  A Systematic Literature Review of Automated ICD Coding and Classification Systems using Discharge Summaries , 2021, ArXiv.

[5]  Jasper Snoek,et al.  Second opinion needed: communicating uncertainty in medical machine learning , 2021, npj Digital Medicine.

[6]  Yang Xiang,et al.  Representation of EHR data for predictive modeling: a comparison between UMLS and other terminologies , 2020, J. Am. Medical Informatics Assoc..

[7]  Xianglong Tang,et al.  Bounded–abstaining classification for breast tumors in imbalanced ultrasound images , 2020, Int. J. Appl. Math. Comput. Sci..

[8]  Joshua Haimson,et al.  Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research , 2020, ArXiv.

[9]  Bo Zhao,et al.  Deep learning in clinical natural language processing: a methodical review , 2019, J. Am. Medical Informatics Assoc..

[10]  Tina Hernandez-Boussard,et al.  Real world evidence in cardiovascular medicine: ensuring data validity in electronic health record-based studies , 2019, J. Am. Medical Informatics Assoc..

[11]  Bhuwan Dhingra,et al.  Combating Adversarial Misspellings with Robust Word Recognition , 2019, ACL.

[12]  Hyoun-Joong Kong,et al.  Managing Unstructured Big Data in Healthcare System , 2019, Healthcare informatics research.

[13]  Roger Brown,et al.  Overcoming the Challenges of Unstructured Data in Multisite, Electronic Medical Record-based Abstraction , 2016, Medical care.

[14]  Ewout W Steyerberg,et al.  Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests , 2016, British Medical Journal.

[15]  G. Hellawell,et al.  The future of electronic health records. , 2013, British journal of hospital medicine.

[16]  Iztok Hozo,et al.  A regret theory approach to decision curve analysis: A novel method for eliciting decision makers' preferences and decision-making , 2010, BMC Medical Informatics Decis. Mak..

[17]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[18]  John F. Hurdle,et al.  Measuring diagnoses: ICD code accuracy. , 2005, Health services research.

[19]  L. Sharp,et al.  Accuracy of CPT evaluation and management coding by family physicians. , 2001, The Journal of the American Board of Family Practice.

[20]  Amal Alzu'bi,et al.  Electronic Health Record (EHR) Abstraction. , 2021, Perspectives in health information management.

[21]  Jimmy J. Lin,et al.  The Art of Abstention: Selective Prediction and Error Regularization for Natural Language Processing , 2021, ACL.

[22]  Sanchita Paul,et al.  Deep Learning Approach for Negation Handling in Sentiment Analysis , 2021, IEEE Access.

[23]  J. S. Marcus,et al.  Is data the new oil? Diminishing returns to scale , 2018 .

[24]  Manali Sharma,et al.  Evidence-based uncertainty sampling for active learning , 2016, Data Mining and Knowledge Discovery.

[25]  Constantine Kotropoulos,et al.  Linear Classifier with Reject Option for the Detection of Vocal Fold Paralysis and Vocal Fold Edema , 2009, EURASIP J. Adv. Signal Process..