A probabilistic topic model for clinical risk stratification from electronic health records

BACKGROUND AND OBJECTIVE Risk stratification aims to provide physicians with the accurate assessment of a patient's clinical risk such that an individualized prevention or management strategy can be developed and delivered. Existing risk stratification techniques mainly focus on predicting the overall risk of an individual patient in a supervised manner, and, at the cohort level, often offer little insight beyond a flat score-based segmentation from the labeled clinical dataset. To this end, in this paper, we propose a new approach for risk stratification by exploring a large volume of electronic health records (EHRs) in an unsupervised fashion. METHODS Along this line, this paper proposes a novel probabilistic topic modeling framework called probabilistic risk stratification model (PRSM) based on Latent Dirichlet Allocation (LDA). The proposed PRSM recognizes a patient clinical state as a probabilistic combination of latent sub-profiles, and generates sub-profile-specific risk tiers of patients from their EHRs in a fully unsupervised fashion. The achieved stratification results can be easily recognized as high-, medium- and low-risk, respectively. In addition, we present an extension of PRSM, called weakly supervised PRSM (WS-PRSM) by incorporating minimum prior information into the model, in order to improve the risk stratification accuracy, and to make our models highly portable to risk stratification tasks of various diseases. RESULTS We verify the effectiveness of the proposed approach on a clinical dataset containing 3463 coronary heart disease (CHD) patient instances. Both PRSM and WS-PRSM were compared with two established supervised risk stratification algorithms, i.e., logistic regression and support vector machine, and showed the effectiveness of our models in risk stratification of CHD in terms of the Area Under the receiver operating characteristic Curve (AUC) analysis. As well, in comparison with PRSM, WS-PRSM has over 2% performance gain, on the experimental dataset, demonstrating that incorporating risk scoring knowledge as prior information can improve the performance in risk stratification. CONCLUSIONS Experimental results reveal that our models achieve competitive performance in risk stratification in comparison with existing supervised approaches. In addition, the unsupervised nature of our models makes them highly portable to the risk stratification tasks of various diseases. Moreover, patient sub-profiles and sub-profile-specific risk tiers generated by our models are coherent and informative, and provide significant potential to be explored for the further tasks, such as patient cohort analysis. We hypothesize that the proposed framework can readily meet the demand for risk stratification from a large volume of EHRs in an open-ended fashion.

[1]  Patrick Royston,et al.  Risk stratification for in-hospital mortality in acutely decompensated heart failure. , 2005, JAMA.

[2]  C. Yancy,et al.  Risk Stratification for In-Hospital Mortality in Acutely Decompensated Heart Failure—Reply , 2005 .

[3]  W John Boscardin,et al.  Risk stratification for in-hospital mortality in acutely decompensated heart failure: classification and regression tree analysis. , 2005, JAMA.

[4]  M. Mcmurdo,et al.  Risk factors and risk assessment tools for falls in hospital in-patients: a systematic review. , 2004, Age and ageing.

[5]  Gabriele Eisenhauer Risk Stratification A Practical Guide For Clinicians , 2016 .

[6]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[7]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[8]  Wei Luo,et al.  Stabilized sparse ordinal regression for medical risk stratification , 2014, Knowledge and Information Systems.

[9]  Constantinos S. Pattichis,et al.  Assessment of the Risk Factors of Coronary Heart Events Based on Data Mining With Decision Trees , 2010, IEEE Transactions on Information Technology in Biomedicine.

[10]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11]  Fei Wang,et al.  Towards actionable risk stratification: A bilinear approach , 2015, J. Biomed. Informatics.

[12]  Yu-Gang Jiang,et al.  A relative similarity based method for interactive patient risk prediction , 2014, Data Mining and Knowledge Discovery.

[13]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[14]  Girish N. Nadkarni,et al.  Incorporating temporal EHR data in predictive models for risk stratification of renal function deterioration , 2014, J. Biomed. Informatics.

[15]  Huilong Duan,et al.  Discovery of clinical pathway patterns from event logs using probabilistic topic models , 2014, J. Biomed. Informatics.

[16]  Nassir Navab,et al.  Stratification of coronary artery disease patients for revascularization procedure based on estimating adverse effects , 2015, BMC Medical Informatics and Decision Making.

[17]  Matthew Clark,et al.  Prediction of clinical risks by analysis of preclinical and clinical adverse events , 2015, J. Biomed. Informatics.

[18]  T Fahey,et al.  Accuracy and impact of risk assessment in the primary prevention of cardiovascular disease: a systematic review , 2006, Heart.

[19]  Chueh-Loo Poh,et al.  A novel neural-inspired learning algorithm with application to clinical risk prediction , 2015, J. Biomed. Informatics.

[20]  Gholam Ali Montazer,et al.  A fuzzy-evidential hybrid inference engine for coronary heart disease risk assessment , 2010, Expert Syst. Appl..

[21]  Huilong Duan,et al.  A genetic fuzzy system for unstable angina risk assessment , 2014, BMC Medical Informatics and Decision Making.

[22]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[23]  Huilong Duan,et al.  On mining latent treatment patterns from electronic medical records , 2015, Data Mining and Knowledge Discovery.

[24]  Dong-Ling Xu,et al.  A belief rule-based decision support system for clinical risk assessment of cardiac chest pain , 2012, Eur. J. Oper. Res..

[25]  Chris Martin,et al.  Description and validation of a Markov model of survival for individuals free of cardiovascular disease that uses Framingham risk factors , 2004, BMC Medical Informatics Decis. Mak..

[26]  A. Leenaars,et al.  Suicide Note Classification Using Natural Language Processing: A Content Analysis , 2010, Biomedical informatics insights.

[27]  Pei-Yun Sabrina Hsueh,et al.  Automatic summarization of risk factors preceding disease progression an insight-driven healthcare service case study on using medical records of diabetic patients , 2014, World Wide Web.

[28]  Gediminas Adomavicius,et al.  Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data , 2014, Data Mining and Knowledge Discovery.

[29]  I. Graham,et al.  Value and limitations of existing scores for the assessment of cardiovascular risk: a review for clinicians. , 2009, Journal of the American College of Cardiology.

[30]  Nan Liu,et al.  Risk Scoring for Prediction of Acute Cardiac Complications from Imbalanced Clinical Data , 2014, IEEE Journal of Biomedical and Health Informatics.

[31]  Shannon Marcoon,et al.  HEART score to further risk stratify patients with low TIMI scores. , 2013, Critical pathways in cardiology.