A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data

BackgroundRandom survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (CIF) are known to correct the bias in RSF models by separating the procedure for the best covariate to split on from that of the best split point search for the selected covariate.MethodsIn this study, we compare the random survival forest model to the conditional inference model (CIF) using twenty-two simulated time-to-event datasets. We also analysed two real time-to-event datasets. The first dataset is based on the survival of children under-five years of age in Uganda and it consists of categorical covariates with most of them having more than two levels (many split-points). The second dataset is based on the survival of patients with extremely drug resistant tuberculosis (XDR TB) which consists of mainly categorical covariates with two levels (few split-points).ResultsThe study findings indicate that the conditional inference forest model is superior to random survival forest models in analysing time-to-event data that consists of covariates with many split-points based on the values of the bootstrap cross-validated estimates for integrated Brier scores. However, conditional inference forests perform comparably similar to random survival forests models in analysing time-to-event data consisting of covariates with fewer split-points.ConclusionAlthough survival forests are promising methods in analysing time-to-event data, it is important to identify the best forest model for analysis based on the nature of covariates of the dataset in question.

[1]  Torsten Hothorn,et al.  On the Exact Distribution of Maximally Selected Rank Statistics , 2002, Comput. Stat. Data Anal..

[2]  John Ehrlinger,et al.  ggRandomForests: Exploring Random Forest Survival , 2016, 1612.08974.

[3]  Antonio Ciampi,et al.  Recursive Partition: A Versatile Method for Exploratory-Data Analysis in Biostatistics , 1987 .

[4]  Hemant Ishwaran,et al.  Evaluating Random Forests for Survival Analysis using Prediction Error Curves. , 2012, Journal of statistical software.

[5]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[6]  Rogers Ayiko,et al.  Trends and determinants of under-five mortality in Uganda. , 2009, East African journal of public health.

[7]  A. Dreher Modeling Survival Data Extending The Cox Model , 2016 .

[8]  Michal Abrahamowicz,et al.  A proportional hazards model with time-dependent covariates and time-varying effects for analysis of fetal and infant death. , 2004, American journal of epidemiology.

[9]  Paul van Helden,et al.  Long-term outcomes of patients with extensively drug-resistant tuberculosis in South Africa: a cohort study , 2014, The Lancet.

[10]  H. Mwambi,et al.  Understanding the determinants of under-five child mortality in Uganda including the estimation of unobserved household and community effects using both frequentist and Bayesian survival analysis approaches , 2015, BMC Public Health.

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .

[13]  Fei Wan,et al.  Simulating survival data with predefined censoring rates for proportional hazards models , 2017, Statistics in medicine.

[14]  Marvin N. Wright,et al.  Unbiased split variable selection for random survival forests using maximally selected rank statistics , 2017, Statistics in medicine.

[15]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[16]  T. Shim,et al.  Treatment outcomes and long-term survival in patients with extensively drug-resistant tuberculosis. , 2008, American journal of respiratory and critical care medicine.

[17]  安藤 寛,et al.  Cross-Validation , 1952, Encyclopedia of Machine Learning and Data Mining.

[18]  Helmut Strasser,et al.  On the Asymptotic Theory of Permutation Statistics , 1999 .

[19]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[20]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[21]  G. Demombynes,et al.  What has driven the decline of infant mortality in Kenya , 2012 .

[22]  S. Younger,et al.  Infant Mortality in Uganda: Determinants, Trends, and the Millennium Development Goals , 2007 .

[23]  Lee-Jen Wei,et al.  The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. , 1992, Statistics in medicine.

[24]  David P. Harrington,et al.  Linear Rank Tests in Survival Analysis , 2014 .

[25]  Denis Larocque,et al.  \(L_1\) splitting rules in survival forests , 2015 .

[26]  R. Olshen,et al.  Tree-structured survival analysis. , 1985, Cancer treatment reports.

[27]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[28]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[29]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[30]  Jeremy M. G. Taylor Random Survival Forests. , 2011, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[31]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[32]  Michael J Crowther,et al.  Simulating biologically plausible complex survival data , 2013, Statistics in medicine.

[33]  S. Houghton,et al.  Virtually impossible: limiting Australian children and adolescents daily screen based media use , 2015, BMC Public Health.

[34]  Berthold Lausen,et al.  Maximally selected rank statistics , 1992 .

[35]  Yee Whye Teh,et al.  Gaussian Processes for Survival Analysis , 2016, NIPS.

[36]  Denis Larocque,et al.  A review of survival trees , 2011 .

[37]  Andreas Ziegler,et al.  Mining data with random forests: current options for real‐world applications , 2014, WIREs Data Mining Knowl. Discov..

[38]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[39]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[40]  Mohamed Abdel-Aty,et al.  Using conditional inference forests to identify the factors affecting crash severity on arterial corridors. , 2009, Journal of safety research.

[41]  Denis Larocque,et al.  $$L_1$$L1 splitting rules in survival forests , 2017, Lifetime data analysis.

[42]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[43]  Terry M. Therneau,et al.  Extending the Cox Model , 1997 .

[44]  L. Fisher,et al.  Time-dependent covariates in the Cox proportional-hazards regression model. , 1999, Annual review of public health.

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Sinae Kim,et al.  Development and Validation of a Quantitative Real-Time Polymerase Chain Reaction Classifier for Lung Cancer Prognosis , 2011, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[47]  N H Ng'andu,et al.  An empirical comparison of statistical tests for assessing the proportional hazards assumption of Cox's model. , 1997, Statistics in medicine.