Automatic prediction of coronary artery disease from clinical narratives

Coronary Artery Disease (CAD) is not only the most common form of heart disease, but also the leading cause of death in both men and women (Coronary Artery Disease: MedlinePlus, 2015). We present a system that is able to automatically predict whether patients develop coronary artery disease based on their narrative medical histories, i.e., clinical free text. Although the free text in medical records has been used in several studies for identifying risk factors of coronary artery disease, to the best of our knowledge our work marks the first attempt at automatically predicting development of CAD. We tackle this task on a small corpus of diabetic patients. The size of this corpus makes it important to limit the number of features in order to avoid overfitting. We propose an ontology-guided approach to feature extraction, and compare it with two classic feature selection techniques. Our system achieves state-of-the-art performance of 77.4% F1 score.

[1]  Goran Nenadic,et al.  Using local lexicalized rules to identify heart disease risk factors in clinical notes , 2015, J. Biomed. Informatics.

[2]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[3]  Xiaolong Wang,et al.  Automatic de-identification of electronic medical records using token-level and character-level conditional random fields , 2015, J. Biomed. Informatics.

[4]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[5]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[6]  Tao Chen,et al.  Hidden Markov model using Dirichlet process for de-identification , 2015, J. Biomed. Informatics.

[7]  Doug Redd,et al.  Informatics can identify systemic sclerosis (SSc) patients at risk for scleroderma renal crisis , 2014, Comput. Biol. Medicine.

[8]  Özlem Uzuner,et al.  Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks , 2015, J. Biomed. Informatics.

[9]  Shuying Shen,et al.  Using Natural Language Processing on the Free Text of Clinical Documents to Screen for Evidence of Homelessness Among US Veterans , 2013, AMIA.

[10]  Nancy Chinchor,et al.  The Statistical Significance of the MUC-4 Results , 1992, MUC.

[11]  Xin Liu,et al.  An automatic system to identify heart disease risk factors in clinical texts over time , 2015, J. Biomed. Informatics.

[12]  Jyotishman Pathak,et al.  Developing EHR-driven heart failure risk prediction models using CPXR(Log) with the probabilistic loss function , 2016, J. Biomed. Informatics.

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  Hussein A. Abbass,et al.  A Comparative Study for Domain Ontology Guided Feature Extraction , 2003, ACSC.

[15]  Jonathan M. Garibaldi,et al.  A hybrid model for automatic identification of risk factors for heart disease , 2015, J. Biomed. Informatics.

[16]  Bin He,et al.  CRFs based de-identification of medical records , 2015, J. Biomed. Informatics.

[17]  Stéphane M. Meystre,et al.  Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes , 2015, J. Biomed. Informatics.

[18]  Eric Fosler-Lussier,et al.  Textual inference for eligibility criteria resolution in clinical trials , 2015, J. Biomed. Informatics.

[19]  Thomas Lavergne,et al.  Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings , 2014, BMC Bioinformatics.

[20]  Manabu Torii,et al.  Risk factor detection for heart disease by applying text analytics in electronic medical records , 2015, J. Biomed. Informatics.

[21]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[22]  Eric Fosler-Lussier,et al.  Comparison of UMLS terminologies to identify risk of heart disease using clinical notes , 2015, J. Biomed. Informatics.

[23]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[24]  Özlem Uzuner,et al.  Annotating risk factors for heart disease in clinical narratives for diabetic patients , 2015, J. Biomed. Informatics.

[25]  Cynthia Brandt,et al.  Ontology-guided feature engineering for clinical text classification , 2012, J. Biomed. Informatics.

[26]  Kalpana Raja,et al.  Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge , 2015, J. Biomed. Informatics.

[27]  Pierre Zweigenbaum,et al.  Combining glass box and black box evaluations in the identification of heart disease risk factors and their temporal relations from clinical records , 2015, J. Biomed. Informatics.

[28]  Jay Urbain,et al.  Mining heart disease risk factors in clinical text with named entity recognition and distributional semantic models , 2015, J. Biomed. Informatics.

[29]  Goran Nenadic,et al.  Combining knowledge- and data-driven methods for de-identification of clinical narratives , 2015, J. Biomed. Informatics.

[30]  Pradeep Kumar Ray,et al.  Coronary artery disease risk assessment from unstructured electronic health records using text mining , 2015, J. Biomed. Informatics.

[31]  Özlem Uzuner,et al.  Creation of a new longitudinal corpus of clinical narratives , 2015, J. Biomed. Informatics.

[32]  Chih-Wei Chen,et al.  A context-aware approach for progression tracking of medical concepts in electronic medical records , 2015, J. Biomed. Informatics.

[33]  Ye Ye,et al.  Comparison of machine learning classifiers for influenza detection from emergency department free-text reports , 2015, J. Biomed. Informatics.

[34]  Wenchao Zhang,et al.  Quality evaluation of extracted ion chromatograms and chromatographic peaks in liquid chromatography/mass spectrometry-based metabolomics data , 2014, BMC Bioinformatics.

[35]  Jonathan M. Garibaldi,et al.  Automatic detection of protected health information from clinic narratives , 2015, J. Biomed. Informatics.

[36]  Rodney D. Nielsen,et al.  Predicting changes in systolic blood pressure using longitudinal patient records , 2015, J. Biomed. Informatics.