Ontology-based venous thromboembolism risk assessment model developing from medical records

Padua linear model is widely used for the risk assessment of venous thromboembolism (VTE), a common but preventable complication for inpatients. However, genetic and environmental differences between Western and Chinese population limit the validity of Padua model in Chinese patients. Medical records which contain rich information about disease progression, are useful in mining new risk factors related to Chinese VTE patients. Furthermore, machine learning (ML) methods provide new opportunities to build precise risk prediction model by automatic selection of risk factors based on original medical records. Medical records of 3,106 inpatients including 224 VTE patients were collected and various types of ontologies were integrated to parse unstructured text. A workflow of ontology-based VTE risk prediction model, that combines natural language processing (NLP) and machine learning (ML) technologies, was proposed. Firstly ontology terms were extracted from medical records, then sorted according to their calculated weights. Next importance of each term in the unit of section was evaluated and finally a ML model was built based on a subset of terms. Four ML methods were tested, and the best model was decided by comparing area under the receiver operating characteristic curve (AUROC). Medical records were first split into different sections and subsequently, terms from each section were sorted by their weights calculated by multiple types of information. Greedy selection algorithm was used to obtain significant sections and terms. Top terms in each section were selected to construct patients’ distributed representations by word embedding technique. Using top 300 terms of two important sections, namely the ‘Progress Note’ section and ‘Admitting Diagnosis’ section, the model showed relatively better predictive performance. Then ML model which utilizes a subset of terms from two sections, about 110 terms, achieved the best AUC score, of 0.973 ± 0.006, which was significantly better compared to the Padua’s performance of 0.791 ± 0.022. Terms found by the model showed their potential to help clinicians explore new risk factors. In this study, a new VTE risk assessment model based on ontologies extraction from raw medical records is developed and its performance is verified on real clinical dataset. Results of selected terms can help clinicians to discover meaningful risk factors.

[1]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[2]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[3]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[4]  Svetha Venkatesh,et al.  Resset: A Recurrent Model for Sequence of Sets with Applications to Electronic Medical Records , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[5]  Le Song,et al.  GRAM: Graph-based Attention Model for Healthcare Representation Learning , 2016, KDD.

[6]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[7]  Shanshan Zhang,et al.  Interpretable Representation Learning for Healthcare via Capturing Disease Progression through Time , 2018, KDD.

[8]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[9]  E. F. de Paiva,et al.  Results of a venous thromboembolism prophylaxis program for hospitalized patients , 2016, Vascular health and risk management.

[10]  S. Resnick,et al.  Alzheimer's Disease Risk Assessment Using Large-Scale Machine Learning Methods , 2013, PLoS ONE.

[11]  P. Trott,et al.  International Classification of Diseases for Oncology , 1977 .

[12]  Fabio Massimo Zanzotto,et al.  Risk Assessment for Venous Thromboembolism in Chemotherapy-Treated Ambulatory Cancer Patients , 2017, Medical decision making : an international journal of the Society for Medical Decision Making.

[13]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[14]  Jimeng Sun,et al.  RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism , 2016, NIPS.

[15]  D. Shen,et al.  Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans , 2016, Scientific Reports.

[16]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[17]  William Stafford Noble,et al.  Support vector machine , 2013 .

[18]  Jimeng Sun,et al.  Using recurrent neural network models for early detection of heart failure onset , 2016, J. Am. Medical Informatics Assoc..

[19]  G. Oster,et al.  Prophylaxis Against Venous Thromboembolism in Hospitalized Medically Ill Patients , 2013, Circulation. Cardiovascular quality and outcomes.

[20]  P. Prandoni,et al.  A risk assessment model for the identification of hospitalized medical patients at risk for venous thromboembolism: the Padua Prediction Score , 2010, Journal of thrombosis and haemostasis : JTH.

[21]  Haleh Vafaie,et al.  Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search , 2009 .

[22]  J. Kai,et al.  Can machine-learning improve cardiovascular risk prediction using routine clinical data? , 2017, PloS one.

[23]  Xiaoqin Liu,et al.  Predicting the occurrence of venous thromboembolism: construction and verification of risk warning model , 2020, BMC Cardiovascular Disorders.

[24]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.