A Robust Framework for Accelerated Outcome-driven Risk Factor Identification from EHR

Electronic Health Records (EHR) containing longitudinal information about millions of patient lives are increasingly being utilized by organizations across the healthcare spectrum. Studies on EHR data have enabled real world applications like understanding of disease progression, outcomes analysis, and comparative effectiveness research. However, often every study is independently commissioned, data is gathered by surveys or specifically purchased per study by a long and often painful process. This is followed by an arduous repetitive cycle of analysis, model building, and generation of insights. This process can take anywhere between 1 - 3 years. In this paper, we present a robust end-to-end machine learning based SaaS system to perform analysis on a very large EHR dataset. The framework consists of a proprietary EHR datamart spanning ~55 million patient lives in USA and over ~20 billion data points. To the best of our knowledge, this framework is the largest in the industry to analyze medical records at this scale, with such efficacy and ease. We developed an end-to-end ML framework with carefully chosen components to support EHR analysis at scale and suitable for further downstream clinical analysis. Specifically, it consists of a ridge regularized Survival Support Vector Machine (SSVM) with a clinical kernel, coupled with Chi-square distance-based feature selection, to uncover relevant risk factors by exploiting the weak correlations in EHR. Our results on multiple real use cases indicate that the framework identifies relevant factors effectively without expert supervision. The framework is stable, generalizable over outcomes, and also found to contribute to better out-of-bound prediction over known expert features. Importantly, the ML methodologies used are interpretable which is critical for acceptance of our system in the targeted user base. With the system being operational, all of these studies were completed within a time frame of 3-4 weeks compared to the industry standard 12-36 months. As such our system can accelerate analysis and discovery, result in better ROI due to reduced investments as well as quicker turn around of studies.

[1]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[2]  Noémie Elhadad,et al.  Survival analysis with electronic health record data: Experiments with chronic kidney disease , 2014, Stat. Anal. Data Min..

[3]  L. Iyer,et al.  A Multi-Case Investigation of Electronic Health Record Implementation in Small- and Medium-Size Physician Practices , 2014 .

[4]  Chun-Ju Hsiao,et al.  Electronic Medical Record/Electronic Health Record Systems of Office-based Physicians: United States, 2009 and Preliminary 2010 State Estimates , 2010 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Yu-Yi Chen,et al.  A Secure EHR System Based on Hybrid Clouds , 2012, Journal of Medical Systems.

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[9]  J. Henry,et al.  Adoption of Electronic Health Record Systems among U . S . Non-Federal Acute Care Hospitals : 2008-2015 , 2013 .

[10]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[11]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[12]  Nassir Navab,et al.  Fast Training of Support Vector Machines for Survival Analysis , 2015, ECML/PKDD.

[13]  Bart De Moor,et al.  Development of a kernel function for clinical data , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[14]  D.,et al.  Regression Models and Life-Tables , 2022 .

[15]  Sabine Van Huffel,et al.  Support vector machines for survival analysis , 2007 .

[16]  Lucila Ohno-Machado Mining electronic health record data: finding the gold nuggets , 2015, J. Am. Medical Informatics Assoc..

[17]  Norihiro Sakamoto,et al.  A framework for dynamic evidence based medicine using data mining , 2002, Proceedings of 15th IEEE Symposium on Computer-Based Medical Systems (CBMS 2002).

[18]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[19]  Kwong-Sak Leung,et al.  The L1/2 regularization method for variable selection in the Cox model , 2014, Appl. Soft Comput..