A General Machine Learning Framework for Survival Analysis

The modeling of time-to-event data, also known as survival analysis, requires specialized methods that can deal with censoring and truncation, time-varying features and effects, and that extend to settings with multiple competing events. However, many machine learning methods for survival analysis only consider the standard setting with right-censored data and proportional hazards assumption. The methods that do provide extensions usually address at most a subset of these challenges and often require specialized software that can not be integrated into standard machine learning workflows directly. In this work, we present a very general machine learning framework for time-to-event analysis that uses a data augmentation strategy to reduce complex survival tasks to standard Poisson regression tasks. This reformulation is based on well developed statistical theory. With the proposed approach, any algorithm that can optimize a Poisson (log-)likelihood, such as gradient boosted trees, deep neural networks, model-based boosting and many more can be used in the context of time-to-event analysis. The proposed technique does not require any assumptions with respect to the distribution of event times or the functional shapes of feature and interaction effects. Based on the proposed framework we develop new methods that are competitive with specialized state of the art approaches in terms of accuracy, and versatility, but with comparatively small investments of programming effort or requirements for specialized methodological know-how.

[1]  D. Cox Regression Models and Life-Tables , 1972 .

[2]  M. Friedman Piecewise Exponential Models for Survival Data with Covariates , 1982 .

[3]  G. Guo Event-history analysis for left-truncated data. , 1993, Sociological methodology.

[4]  K. Liestøl,et al.  Survival analysis and neural nets. , 1994, Statistics in medicine.

[5]  D Faraggi,et al.  A neural network model for survival data. , 1995, Statistics in medicine.

[6]  J. Klein,et al.  Survival Analysis: Techniques for Censored and Truncated Data , 1997 .

[7]  X. Huang,et al.  Piecewise exponential survival trees with time-dependent covariates. , 1998, Biometrics.

[8]  Rob J Hyndman,et al.  Mixed Model-Based Hazard Estimation , 2002 .

[9]  Terry M Therneau,et al.  A long-term study of prognosis in monoclonal gammopathy of undetermined significance. , 2002, The New England journal of medicine.

[10]  Elia Biganzoli,et al.  A general framework for neural network models on censored survival data , 2002, Neural Networks.

[11]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[12]  M. Schumacher,et al.  Consistent Estimation of the Expected Brier Score in General Survival Models with Right‐Censored Event Times , 2006, Biometrical journal. Biometrische Zeitschrift.

[13]  Torsten Hothorn,et al.  Model-based boosting in high dimensions , 2006, Bioinform..

[14]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[15]  Harald Binder,et al.  Boosting for high-dimensional time-to-event data with competing risks , 2009, Bioinform..

[16]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[17]  Denis Larocque,et al.  A review of survival trees , 2011 .

[18]  Thomas A Gerds,et al.  Estimating a time‐dependent concordance index for survival prediction models with covariate dependent censoring , 2013, Statistics in medicine.

[19]  Elia Biganzoli,et al.  Piecewise Exponential Artificial Neural Networks (PEANN) for Modeling Hazard Function with Right Censored Data , 2013, CIBB.

[20]  Bendix Carstensen,et al.  Multiple time scales in multi‐state models , 2013, Statistics in medicine.

[21]  Hemant Ishwaran,et al.  Random survival forests for competing risks. , 2014, Biostatistics.

[22]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[23]  Deepak Agarwal,et al.  GLMix: Generalized Linear Mixed Models For Large-Scale Response Prediction , 2016, KDD.

[24]  Adler J. Perotte,et al.  Deep Survival Analysis , 2016, MLHC.

[25]  Reynold Xin,et al.  Apache Spark , 2016 .

[26]  Thomas Kneib,et al.  Boosting multi-state models , 2015, Lifetime Data Analysis.

[27]  Thomas Kneib,et al.  Structured fusion lasso penalized multi‐state models , 2016, Statistics in medicine.

[28]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[29]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[30]  Ahmed M. Alaa,et al.  Deep Multi-task Gaussian Processes for Survival Analysis with Competing Risks , 2017, NIPS.

[31]  Hemant Ishwaran,et al.  Boosted Nonparametric Hazards with Time-Dependent Covariates , 2017, Annals of statistics.

[32]  Andreas Bender,et al.  A generalized additive model approach to time-to-event analysis , 2018 .

[33]  Changhee Lee,et al.  DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks , 2018, AAAI.

[34]  Fabian Scheipl,et al.  Penalized estimation of complex, non‐linear exposure‐lag‐response associations , 2019, Biostatistics.

[35]  Ping Wang,et al.  Machine Learning for Survival Analysis , 2019, ACM Comput. Surv..

[36]  Noah Simon,et al.  OBLIQUE RANDOM SURVIVAL FORESTS. , 2019, The annals of applied statistics.

[37]  Changhee Lee,et al.  Dynamic-DeepHit: A Deep Learning Approach for Dynamic Survival Analysis With Competing Risks Based on Longitudinal Data , 2020, IEEE Transactions on Biomedical Engineering.