Random forests for survival analysis using maximally selected rank statistics

The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption is not always fulfilled. An alternative approach is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistics, which favors splitting variables with many possible split points. Conditional inference forests avoid this split point selection bias. However, linear rank statistics are utilized in current software for conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. We therefore use maximally selected rank statistics for split point selection in random forests for survival analysis. As in conditional inference forests, p-values for association between split points and survival time are minimized. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split point selection is possible. However, there is a trade-off between unbiased split point selection and runtime. In benchmark studies of prediction performance on simulated and real datasets the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used.

[1]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  Carolin Strobl,et al.  Letter to the Editor: On the term ‘interaction’ and related phrases in the literature on Random Forests , 2014, Briefings Bioinform..

[4]  Stefan Wager Asymptotic Theory for Random Forests , 2014, 1405.0352.

[5]  Achim Zeileis,et al.  Generalized Maximally Selected Statistics , 2008, Biometrics.

[6]  R B Davis,et al.  Exponential survival trees. , 1989, Statistics in medicine.

[7]  Udaya B. Kogalur,et al.  Consistency of Random Survival Forests. , 2008, Statistics & probability letters.

[8]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[9]  G. Hommel A stagewise rejective multiple test procedure based on a modified Bonferroni test , 1988 .

[10]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[11]  Torsten Hothorn,et al.  Bagging survival trees , 2002, Statistics in medicine.

[12]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[13]  Anne-Laure Boulesteix,et al.  Maximally Selected Chi‐Square Statistics and Binary Splits of Nominal Variables , 2006, Biometrical journal. Biometrische Zeitschrift.

[14]  Frank Bretz,et al.  Assessment of Optimal Selected Prognostic Factors , 2002 .

[15]  Andreas Ziegler,et al.  Mining data with random forests: current options for real‐world applications , 2014, WIREs Data Mining Knowl. Discov..

[16]  W. Sauerbrei,et al.  Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German Breast Cancer Study Group. , 1994, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[17]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.

[18]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[19]  Torsten Hothorn,et al.  On the Exact Distribution of Maximally Selected Rank Statistics , 2002, Comput. Stat. Data Anal..

[20]  Stefan Wager,et al.  Uniform Convergence of Random Forests via Adaptive Concentration , 2015 .

[21]  K. Worsley An improved Bonferroni inequality and applications , 1982 .

[22]  D. Siegmund,et al.  Maximally Selected Chi Square Statistics , 1982 .

[23]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[26]  Wei-Yin Loh,et al.  Fifty Years of Classification and Regression Trees , 2014 .

[27]  P. Sen,et al.  Theory of rank tests , 1969 .

[28]  D.,et al.  Regression Models and Life-Tables , 2022 .

[29]  Berthold Lausen,et al.  Classification and regression trees (CART) used for the exploration of prognostic factors measured on different scales , 1994 .

[30]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[31]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[32]  Anne-Laure Boulesteix,et al.  Maximally Selected Chi‐square Statistics for Ordinal Variables , 2006, Biometrical journal. Biometrische Zeitschrift.

[33]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[34]  Hemant Ishwaran,et al.  Evaluating Random Forests for Survival Analysis using Prediction Error Curves. , 2012, Journal of statistical software.

[35]  R A Betensky,et al.  Maximally selected chi2 statistics for k x 2 tables. , 1999, Biometrics.

[36]  A. Genz Numerical Computation of Multivariate Normal Probabilities , 1992 .

[37]  S. Keleş,et al.  Residual‐based tree‐structured survival analysis , 2002, Statistics in medicine.

[38]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[39]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[40]  Jelle J. Goeman,et al.  Multiple hypothesis testing in genomics , 2014, Statistics in medicine.

[41]  Denis Larocque,et al.  A review of survival trees , 2011 .

[42]  Achim Zeileis,et al.  Discussion on Fifty Years of Classification and Regression Trees , 2014 .

[43]  Berthold Lausen,et al.  Maximally selected rank statistics , 1992 .

[44]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[45]  ipred : Improved Predictors , 2009 .

[46]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[47]  Hemant Ishwaran,et al.  The effect of splitting on random forests , 2014, Machine Learning.

[48]  H. Sack [A randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients]. , 1995, Strahlentherapie und Onkologie : Organ der Deutschen Rontgengesellschaft ... [et al].