PARAMO: A PARAllel predictive MOdeling platform for healthcare analytic research using electronic health records

OBJECTIVE Healthcare analytics research increasingly involves the construction of predictive models for disease targets across varying patient cohorts using electronic health records (EHRs). To facilitate this process, it is critical to support a pipeline of tasks: (1) cohort construction, (2) feature construction, (3) cross-validation, (4) feature selection, and (5) classification. To develop an appropriate model, it is necessary to compare and refine models derived from a diversity of cohorts, patient-specific features, and statistical frameworks. The goal of this work is to develop and evaluate a predictive modeling platform that can be used to simplify and expedite this process for health data. METHODS To support this goal, we developed a PARAllel predictive MOdeling (PARAMO) platform which (1) constructs a dependency graph of tasks from specifications of predictive modeling pipelines, (2) schedules the tasks in a topological ordering of the graph, and (3) executes those tasks in parallel. We implemented this platform using Map-Reduce to enable independent tasks to run in parallel in a cluster computing environment. Different task scheduling preferences are also supported. RESULTS We assess the performance of PARAMO on various workloads using three datasets derived from the EHR systems in place at Geisinger Health System and Vanderbilt University Medical Center and an anonymous longitudinal claims database. We demonstrate significant gains in computational efficiency against a standard approach. In particular, PARAMO can build 800 different models on a 300,000 patient data set in 3h in parallel compared to 9days if running sequentially. CONCLUSION This work demonstrates that an efficient parallel predictive modeling platform can be developed for EHR data. This platform can facilitate large-scale modeling endeavors and speed-up the research workflow and reuse of health information. This platform is only a first step and provides the foundation for our ultimate goal of building analytic pipelines that are specialized for health data researchers.

[1]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[2]  Blaz Zupan,et al.  Predictive data mining in clinical medicine: Current issues and guidelines , 2008, Int. J. Medical Informatics.

[3]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..

[4]  Qichang Chen,et al.  MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS , 2008, 2008 IEEE Fourth International Conference on eScience.

[5]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[6]  Keith D. Cooper,et al.  An Experimental Evaluation of List Scheduling , 1998 .

[7]  Shuying Shen,et al.  Automated extraction of ejection fraction for quality measurement using regular expressions in Unstructured Information Management Architecture (UIMA) for heart failure , 2012, J. Am. Medical Informatics Assoc..

[8]  Yuan Luo,et al.  Identifying patient smoking status from medical discharge records. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[9]  Suzette J. Bielinski,et al.  Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study , 2012, J. Am. Medical Informatics Assoc..

[10]  Paul T. Groth,et al.  Wings: Intelligent Workflow-Based Design of Computational Experiments , 2011, IEEE Intelligent Systems.

[11]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[12]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.

[13]  Hsinchun Chen,et al.  Medical Informatics: Knowledge Management and Data Mining in Biomedicine (Operations Research/Computer Science Interfaces) , 2005 .

[14]  Joshua C Denny,et al.  Development of Inpatient Risk Stratification Models of Acute Kidney Injury for Use in Electronic Health Records , 2010, Medical decision making : an international journal of the Society for Medical Decision Making.

[15]  A. Choudhary,et al.  Development of a 5 year life expectancy index in older adults using predictive mining of electronic health record data. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[16]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Andrew J Vickers,et al.  Prediction models in cancer care , 2011, CA: a cancer journal for clinicians.

[18]  Brian Neelon,et al.  Accurately Predicting Bipolar Disorder Mood Outcomes: Implications for the use of Electronic Databases , 2012, Medical care.

[19]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[20]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Julia Adler-Milstein,et al.  Healthcare's "big data" challenge. , 2013, The American journal of managed care.

[23]  Jimeng Sun,et al.  Automatic identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records , 2014, Int. J. Medical Informatics.

[24]  Y. Tabak,et al.  An Automated Model to Identify Heart Failure Patients at Risk for 30-Day Readmission or Death Using Electronic Medical Record Data , 2010, Medical care.

[25]  Fei Wang,et al.  Medical prognosis based on patient similarity and expert feedback , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[26]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[27]  Hanyu Ni,et al.  Prediction of Incident Heart Failure in General Practice: The Atherosclerosis Risk in Communities (ARIC) Study , 2012, Circulation. Heart failure.

[28]  Andreas Neumann,et al.  Oozie: towards a scalable workflow management system for Hadoop , 2012, SWEET '12.

[29]  Fei Wang,et al.  Supervised patient similarity measure of heterogeneous patient records , 2012, SKDD.

[30]  Di Zhao,et al.  Combining PubMed knowledge and EHR data to develop a weighted bayesian network for pancreatic cancer prediction , 2011, J. Biomed. Informatics.

[31]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[32]  Fei Wang,et al.  ICDA: A Platform for Intelligent Care Delivery Analytics , 2012, AMIA.

[33]  Harlan M Krumholz,et al.  Statistical models and patient predictors of readmission for heart failure: a systematic review. , 2008, Archives of internal medicine.

[34]  Richard Kaplan,et al.  An Electronic Medical Record-Based Model to Predict 30-Day Risk of Readmission and Death Among HIV-Infected Inpatients , 2012, Journal of acquired immune deficiency syndromes.

[35]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[36]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[37]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[38]  David G. Stork,et al.  Pattern Classification , 1973 .

[39]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[40]  Hua Xu,et al.  Research and applications: ICD-9 tobacco use codes are effective identifiers of smoking status , 2013, J. Am. Medical Informatics Assoc..

[41]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Xiaowu Sun,et al.  Development and Validation of a Mortality Risk-Adjustment Model for Patients Hospitalized for Exacerbations of Chronic Obstructive Pulmonary Disease , 2013, Medical care.

[44]  Jordi Torres,et al.  Resource-Aware Adaptive Scheduling for MapReduce Clusters , 2011, Middleware.

[45]  Patricia Kipnis,et al.  Early detection of impending physiologic deterioration among patients who are not in intensive care: development of predictive models using data from an automated electronic medical record. , 2012, Journal of hospital medicine.

[46]  J. Després,et al.  Abdominal obesity and metabolic syndrome , 2006, Nature.

[47]  Seung-Jong Park,et al.  Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks , 2012, FederatedClouds '12.

[48]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[49]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[50]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.