Review of statistical methods for survival analysis using genomic data

Survival analysis mainly deals with the time to event, including death, onset of disease, and bankruptcy. The common characteristic of survival analysis is that it contains “censored” data, in which the time to event cannot be completely observed, but instead represents the lower bound of the time to event. Only the occurrence of either time to event or censoring time is observed. Many traditional statistical methods have been effectively used for analyzing survival data with censored observations. However, with the development of high-throughput technologies for producing “omics” data, more advanced statistical methods, such as regularization, should be required to construct the predictive survival model with high-dimensional genomic data. Furthermore, machine learning approaches have been adapted for survival analysis, to fit nonlinear and complex interaction effects between predictors, and achieve more accurate prediction of individual survival probability. Presently, since most clinicians and medical researchers can easily assess statistical programs for analyzing survival data, a review article is helpful for understanding statistical methods used in survival analysis. We review traditional survival methods and regularization methods, with various penalty functions, for the analysis of high-dimensional genomics, and describe machine learning techniques that have been adapted to survival analysis.

[1]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[2]  Ron Brookmeyer,et al.  A k-Sample Median Test for Censored Data , 1982 .

[3]  O. Aalen A linear regression model for the analysis of life times. , 1989, Statistics in medicine.

[4]  Wei Chu,et al.  A Support Vector Approach to Censored Targets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[6]  Tom C. Freeman,et al.  Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data , 2008, Molecular Cancer Therapeutics.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  I. Langner Survival Analysis: Techniques for Censored and Truncated Data , 2006 .

[9]  Esa Ollila,et al.  Pathwise least angle regression and a significance test for the elastic net , 2017, 2017 25th European Signal Processing Conference (EUSIPCO).

[10]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[11]  Xiaogang Su,et al.  Multivariate exponential survival trees and their application to tooth prognosis , 2009, Comput. Stat. Data Anal..

[12]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[13]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[14]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[15]  Riccardo De Bin,et al.  Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost , 2016, Comput. Stat..

[16]  M. LeBlanc,et al.  Survival Trees by Goodness of Split , 1993 .

[17]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[18]  Zhiliang Ying,et al.  Semiparametric analysis of the additive risk model , 1994 .

[19]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[20]  M. Schumacher,et al.  Two-sample Tests of Crambr-von Mises- and Kolmogorov-Smirnov-type for Randomly Censored Data , 1984 .

[21]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[22]  A. Ciampi,et al.  Stratification by stepwise regression, correspondence analysis and recursive partition: A comparison of three methods of analysis for survival data with covaria , 1986 .

[23]  N. Breslow,et al.  Analysis of Survival Data under the Proportional Hazards Model , 1975 .

[24]  Xiaogang Su,et al.  Constructing Multivariate Survival Trees: The MST Package for R , 2018 .

[25]  N. Breslow,et al.  A Large Sample Study of the Life Table and Product Limit Estimates Under Random Censorship , 1974 .

[26]  Xun Zhu,et al.  Cox-nnet: An artificial neural network method for prognosis prediction of high-throughput omics data , 2018, PLoS Comput. Biol..

[27]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[28]  T. Fearn Ridge Regression , 2013 .

[29]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[30]  D Faraggi,et al.  A neural network model for survival data. , 1995, Statistics in medicine.

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[33]  J. Shavlik,et al.  Breast cancer risk estimation with artificial neural networks revisited , 2010, Cancer.

[34]  C. Floyd,et al.  Prediction of breast cancer malignancy using an artificial neural network , 1994, Cancer.

[35]  Hao Helen Zhang,et al.  Adaptive Lasso for Cox's proportional hazards model , 2007 .

[36]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[37]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[38]  Faisal M. Khan,et al.  Support Vector Regression for Censored Data (SVRc): A Novel Tool for Survival Analysis , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[39]  B. Efron The two sample problem with censored data , 1967 .

[40]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[41]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[42]  Susmita Datta,et al.  Predicting survival times for neuroblastoma patients using RNA-seq expression profiles , 2018, Biology Direct.

[43]  D.,et al.  Regression Models and Life-Tables , 2022 .

[44]  Russell Greiner,et al.  Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors , 2011, NIPS.

[45]  Torsten Hothorn,et al.  Flexible boosting of accelerated failure time models , 2008, BMC Bioinformatics.

[46]  O. Borgan The Statistical Analysis of Failure Time Data (2nd Ed.). John D. Kalbfleisch and Ross L. Prentice , 2003 .

[47]  Torsten Hothorn,et al.  Bagging survival trees , 2002, Statistics in medicine.

[48]  Xi Chen,et al.  Random survival forests for high‐dimensional data , 2011, Stat. Anal. Data Min..

[49]  Xiaogang Su,et al.  Multivariate Survival Trees: A Maximum Likelihood Approach Based on Frailty Models , 2004, Biometrics.

[50]  Zhongxue Chen,et al.  Comparing survival curves based on medians , 2016, BMC Medical Research Methodology.

[51]  E. Kaplan,et al.  Nonparametric Estimation from Incomplete Observations , 1958 .

[52]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[53]  J. Peto,et al.  Asymptotically Efficient Rank Invariant Test Procedures , 1972 .

[54]  M. Schumacher,et al.  Two-Sample Tests of Cramér--von Mises- and Kolmogorov--Smirnov-Type for Randomly Censored Data@@@Two-Sample Tests of Cramer--von Mises- and Kolmogorov--Smirnov-Type for Randomly Censored Data , 1984 .

[55]  Hemant Ishwaran,et al.  Boosted Nonparametric Hazards with Time-Dependent Covariates , 2017, Annals of statistics.

[56]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[57]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[58]  Chih-Lin Chi,et al.  Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets , 2007, AMIA.

[59]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[60]  N Mantel,et al.  Mantel-Haenszel analyses of litter-matched time-to-response data, with modifications for recovery of interlitter information. , 1977, Cancer research.

[61]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..