Balanced Random Survival Forests for Extremely Unbalanced, Right Censored Data

Accuracies of survival models for life expectancy prediction as well as critical-care applications are significantly compromised due to the sparsity of samples and extreme imbalance between the survival (usually, the majority) and mortality class sizes. While a recent random survival forest (RSF) model overcomes the limitations of the proportional hazard assumption, an imbalance in the data results in an underestimation (overestimation) of the hazard of the mortality (survival) classes. A balanced random survival forests (BRSF) model, based on training the RSF model with data generated from a synthetic minority sampling scheme is presented to address this gap. Theoretical results on the effect of balancing on prediction accuracies in BRSF are reported. Benchmarking studies were conducted using five datasets with different levels of class imbalance from public repositories and an imbalanced dataset of 267 acute cardiac patients, collected at the Heart, Artery, and Vein Center of Fresno, CA. Investigations suggest that BRSF provides an improved discriminatory strength between the survival and the mortality classes. It outperformed both optimized Cox (without and with balancing) and RSF with an average reduction of 55\% in the prediction error over the next best alternative.

[1]  Z. Cai Asymptotic properties of Kaplan-Meier estimator for censored dependent data , 1998 .

[2]  Thomas A Gerds,et al.  Efron‐Type Measures of Prediction Error for Survival Analysis , 2007, Biometrics.

[3]  Jane A. Linderbaum,et al.  2013 ACCF/AHA guideline for the management of ST-elevation myocardial infarction: a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. , 2013, Journal of the American College of Cardiology.

[4]  U. Goldbourt,et al.  Predictors of cardiac and noncardiac mortality among 14,697 patients with coronary heart disease. , 2003, The American journal of cardiology.

[5]  Sabine Van Huffel,et al.  Learning Transformation Models for Ranking and Survival Analysis , 2011, J. Mach. Learn. Res..

[6]  Gilbert MacKenzie,et al.  The Statistical Analysis of Failure Time Data , 1982 .

[7]  Douglas E. Schaubel,et al.  Proportional hazards models based on biased samples and estimated selection probabilities , 2008 .

[8]  M. Schumacher,et al.  Consistent Estimation of the Expected Brier Score in General Survival Models with Right‐Censored Event Times , 2006, Biometrical journal. Biometrische Zeitschrift.

[9]  Frank E. Harrell,et al.  Cox Proportional Hazards Regression Model , 2015 .

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Danyu Lin,et al.  On fitting Cox's proportional hazards models to survey data , 2000 .

[12]  Hemant Ishwaran,et al.  Identifying Important Risk Factors for Survival in Patient With Systolic Heart Failure Using Random Survival Forests , 2011, Circulation. Cardiovascular quality and outcomes.

[13]  John Ehrlinger,et al.  ggRandomForests: Exploring Random Forest Survival , 2016, 1612.08974.

[14]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[15]  Udaya B. Kogalur,et al.  Consistency of Random Survival Forests. , 2008, Statistics & probability letters.

[16]  A. Jaffe,et al.  Elevated cardiac troponin levels predict the risk of adverse outcome in patients with acute coronary syndromes. , 2000, American heart journal.

[17]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[18]  L Ohno-Machado,et al.  A comparison of Cox proportional hazards and artificial neural network models for medical prognosis , 1997, Comput. Biol. Medicine.

[19]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[20]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[21]  G. Hillis,et al.  Relationship Between Postoperative Cardiac Troponin I Levels and Outcome of Cardiac Surgery , 2006, Circulation.

[22]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[23]  I. Rubinfeld,et al.  Looking Beyond Historical Patient Outcomes to Improve Clinical Models , 2012, Science Translational Medicine.

[24]  Ø. Borgan Nelson–Aalen Estimator , 2005 .

[25]  J. Lagarias Euler's constant: Euler's work and modern developments , 2013, 1303.1856.

[26]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[27]  P. Novotny,et al.  Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. , 1994, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[28]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[29]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[30]  A. G. Greenhill,et al.  Handbook of Mathematical Functions with Formulas, Graphs, , 1971 .

[31]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[32]  Xi Chen,et al.  Random survival forests for high‐dimensional data , 2011, Stat. Anal. Data Min..

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  Hemant Ishwaran,et al.  Evaluating Random Forests for Survival Analysis using Prediction Error Curves. , 2012, Journal of statistical software.

[35]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[36]  A Starr,et al.  Long-term outcome, survival analysis, and risk stratification of dynamic cardiomyoplasty. , 1996, The Journal of thoracic and cardiovascular surgery.

[37]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[38]  Nathalie Japkowicz,et al.  The Class Imbalance Problem: Significance and Strategies , 2000 .