Boosted Trees for Risk Prognosis

We present a new approach to ensemble learning for risk prognosis in heterogeneous medical populations. Our aim is to improve overall prognosis by focusing on under-represented patient subgroups with an atypical disease presentation; with current prognostic tools, these subgroups are being consistently mis-estimated. Our method proceeds sequentially by learning nonparametric survival estimators which iteratively learn to improve predictions of previously misdiagnosed patients a process called boosting. This results in fully nonparametric survival estimates, that is, constrained neither by assumptions regarding the baseline hazard nor assumptions regarding the underlying covariate interactions and thus differentiating our approach from existing boosting methods for survival analysis. In addition, our approach yields a measure of the relative covariate importance that accurately identifies relevant covariates within complex survival dynamics, thereby informing further medical understanding of disease interactions. We study the properties of our approach on a variety of heterogeneous medical datasets, demonstrating significant performance improvements over existing survival and ensemble methods.

[1]  S. Wannamethee,et al.  Serum creatinine concentration and risk of cardiovascular disease: a possible marker for increased risk of stroke. , 1997, Stroke.

[2]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[3]  Matthias Schmid,et al.  Boosting the Concordance Index for Survival Data – A Unified Framework To Derive and Evaluate Biomarker Combinations , 2013, PloS one.

[4]  Iain B Squire,et al.  Heart failure in younger patients: the Meta-analysis Global Group in Chronic Heart Failure (MAGGIC). , 2014, European heart journal.

[5]  Oznur Tastan,et al.  Integromic Analysis of Genetic Variation and Gene Expression Identifies Networks for Cardiovascular Disease Phenotypes , 2015, Circulation.

[6]  Yee Whye Teh,et al.  Gaussian Processes for Survival Analysis , 2016, NIPS.

[7]  Durga L. Shrestha,et al.  Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression , 2006, Neural Computation.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[10]  M. LeBlanc,et al.  Relative risk trees for censored survival data. , 1992, Biometrics.

[11]  Mihaela van der Schaar,et al.  DPSCREEN: Dynamic Personalized Screening , 2017, NIPS.

[12]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[13]  Carol Keohane,et al.  Missed Diagnosis of Cardiovascular Disease in Outpatient General Medicine: Insights from Malpractice Claims Data. , 2017, Joint Commission journal on quality and patient safety.

[14]  Kenneth H Falchuk,et al.  The misdiagnosis epidemic: Five root causes and the growing demand for more patient-centric care , 2012 .

[15]  M. Fornage,et al.  Heart Disease and Stroke Statistics—2017 Update: A Report From the American Heart Association , 2017, Circulation.

[16]  D.P. Solomatine,et al.  AdaBoost.RT: a boosting algorithm for regression problems , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[17]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[18]  Ian A Scott,et al.  Diagnostic errors in older patients: a systematic review of incidence and potential causes in seven prevalent diseases , 2016, International journal of general medicine.

[19]  W. Robb MacLellan,et al.  Systems-based approaches to cardiovascular disease , 2012, Nature Reviews Cardiology.

[20]  D. Srivastava,et al.  Genetics of Human Cardiovascular Disease , 2012, Cell.

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[23]  Thomas A Gerds,et al.  Estimating a time‐dependent concordance index for survival prediction models with covariate dependent censoring , 2013, Statistics in medicine.

[24]  Asanao Shimokawa,et al.  Comparison of Splitting Methods on Survival Tree , 2015, The international journal of biostatistics.

[25]  Adler J. Perotte,et al.  Deep Survival Analysis , 2016, MLHC.

[26]  Harris Drucker,et al.  Improving Regressors using Boosting Techniques , 1997, ICML.

[27]  D. Levy,et al.  Development of a risk score for atrial fibrillation (Framingham Heart Study): a community-based cohort study , 2009, The Lancet.

[28]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[29]  G. Ridgeway The State of Boosting ∗ , 1999 .

[30]  Hongzhe Li,et al.  Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data , 2005, Bioinform..

[31]  Hemant Ishwaran,et al.  Evaluating Random Forests for Survival Analysis using Prediction Error Curves. , 2012, Journal of statistical software.

[32]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .