Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data

MOTIVATION An important application of microarray technology is to relate gene expression profiles to various clinical phenotypes of patients. Success has been demonstrated in molecular classification of cancer in which the gene expression data serve as predictors and different types of cancer serve as a categorical outcome variable. However, there has been less research in linking gene expression profiles to the censored survival data such as patients' overall survival time or time to cancer relapse. It would be desirable to have models with good prediction accuracy and parsimony property. RESULTS We propose to use the L(1) penalized estimation for the Cox model to select genes that are relevant to patients' survival and to build a predictive model for future prediction. The computational difficulty associated with the estimation in the high-dimensional and low-sample size settings can be efficiently solved by using the recently developed least-angle regression (LARS) method. Our simulation studies and application to real datasets on predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed procedure, which we call the LARS-Cox procedure, can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. The LARS-Cox regression gives better predictive performance than the L(2) penalized regression and a few other dimension-reduction based methods. CONCLUSIONS We conclude that the proposed LARS-Cox procedure can be very useful in identifying genes relevant to survival phenotypes and in building a parsimonious predictive model that can be used for classifying future patients into clinically relevant high- and low-risk groups based on the gene expression profile and survival times of previous patients.

[1]  D. Cox Regression Models and Life-Tables , 1972 .

[2]  L. J. Wei,et al.  The Robust Inference for the Cox Proportional Hazards Model , 1989 .

[3]  Lee-Jen Wei,et al.  The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. , 1992, Statistics in medicine.

[4]  Z. Ying,et al.  Checking the Cox model with cumulative sums of martingale-based residuals , 1993 .

[5]  P. J. Verweij,et al.  Cross-validation in survival analysis. , 1993, Statistics in medicine.

[6]  M. Akritas Nearest Neighbor Estimation of a Bivariate Distribution Under Random Censoring , 1994 .

[7]  Z. Ying,et al.  Analysis of transformation models with censored data , 1995 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[10]  V. Petruzzella,et al.  Identification and characterization of human cDNAs specific to BCS1, PET112, SCO1, COX15, and COX11, five genes involved in the formation and function of the mitochondrial respiratory chain. , 1998, Genomics.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Y. Pekarsky,et al.  Abnormalities at 14q32.1 in T cell malignancies involve two oncogenes. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[15]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[16]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[17]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[18]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  R. Tibshirani,et al.  implications Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical , 2001 .

[21]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[22]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.

[23]  D. Harrington,et al.  Penalized Partial Likelihood Regression for Right‐Censored Data with Bootstrap Selection of the Penalty Parameter , 2002, Biometrics.

[24]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[25]  Hongzhe Li,et al.  Kernel Cox Regression Models for Linking Gene Expression Profiles to Censored Survival Data , 2002, Pacific Symposium on Biocomputing.

[26]  H. Zou,et al.  Regression Shrinkage and Selection via the Elastic Net , with Applications to Microarrays , 2003 .

[27]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[28]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[29]  E. Shoubridge,et al.  Functional and genetic studies demonstrate that mutation in the COX15 gene can cause Leigh syndrome , 2004, Journal of Medical Genetics.

[30]  S. Keleş,et al.  Statistical Applications in Genetics and Molecular Biology Asymptotic Optimality of Likelihood-Based Cross-Validation , 2011 .

[31]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[32]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[33]  D.,et al.  Regression Models and Life-Tables , 2022 .