Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited.

Diffuse large-B-cell lymphoma (DLBCL) is an aggressive malignancy of mature B lymphocytes and is the most common type of lymphoma in adults. While treatment advances have been substantial in what was formerly a fatal disease, less than 50% of patients achieve lasting remission. In an effort to predict treatment success and explain disease heterogeneity clinical features have been employed for prognostic purposes, but have yielded only modest predictive performance. This has spawned a series of high-profile microarray-based gene expression studies of DLBCL, in the hope that molecular-level information could be used to refine prognosis. The intent of this paper is to reevaluate these microarray-based prognostic assessments, and extend the statistical methodology that has been used in this context. Methodological challenges arise in using patients' gene expression profiles to predict survival endpoints on account of the large number of genes and their complex interdependence. We initially focus on the Lymphochip data and analysis of Rosenwald et al. (2002). After describing relationships between the analyses performed and gene harvesting (Hastie et al., 2001a), we argue for the utility of penalized approaches, in particular least angle regression-least absolute shrinkage and selection operator (Efron et al., 2004). While these techniques have been extended to the proportional hazards/partial likelihood framework, the resultant algorithms are computationally burdensome. We develop residual-based approximations that eliminate this burden yet perform similarly. Comparisons of predictive accuracy across both methods and studies are effected using time-dependent receiver operating characteristic curves. These indicate that gene expression data, in turn, only delivers modest predictions of posttherapy DLBCL survival. We conclude by outlining possibilities for further work.

[1]  D. Cox Regression Models and Life-Tables , 1972 .

[2]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[3]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[4]  M. LeBlanc,et al.  Relative risk trees for censored survival data. , 1992, Biometrics.

[5]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[6]  M. Akritas Nearest Neighbor Estimation of a Bivariate Distribution Under Random Censoring , 1994 .

[7]  Statistical Issues in the Evaluation of Markers of HIV Progression , 1995 .

[8]  T M Therneau,et al.  Diagnostic plots to reveal functional form for covariates in multiplicative intensity models. , 1995, Biometrics.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[11]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  M LeBlanc,et al.  Adaptive Regression Splines in the Cox Model , 1999, Biometrics.

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[16]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[17]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[18]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[19]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[20]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[21]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[22]  P. Grambsch,et al.  Modeling Survival Data: Extending the Cox Model , 2000 .

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[25]  L. Staudt,et al.  Signatures of the immune response. , 2001, Immunity.

[26]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[27]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[29]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[30]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[31]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[33]  S. Keleş,et al.  Residual‐based tree‐structured survival analysis , 2002, Statistics in medicine.

[34]  T. Speed,et al.  Statistical issues in cDNA microarray data analysis. , 2003, Methods in molecular biology.

[35]  Guoying Liu,et al.  NetAffx: Affymetrix probesets and annotations , 2003, Nucleic Acids Res..

[36]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[37]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[38]  Kam D. Dahlquist,et al.  Regression Approaches for Microarray Data Analysis , 2002, J. Comput. Biol..

[39]  Adrian Wiestner,et al.  A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[40]  G. Parmigiani,et al.  The Analysis of Gene Expression Data , 2003 .

[41]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[42]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[43]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[44]  M. West,et al.  Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[45]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[46]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[47]  Ash A. Alizadeh,et al.  Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. , 2004, The New England journal of medicine.

[48]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[49]  B. Efron The Estimation of Prediction Error , 2004 .

[50]  P. Heagerty,et al.  Survival Model Predictive Accuracy and ROC Curves , 2005, Biometrics.

[51]  George C Tseng,et al.  Tight Clustering: A Resampling‐Based Approach for Identifying Stable and Tight Patterns in Data , 2005, Biometrics.

[52]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[53]  John OQuigly and Ronghui Xu,et al.  Explained Variation in Proportional Hazards Regression , 2005 .

[54]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[55]  Willem A Rensink,et al.  Statistical issues in microarray data analysis. , 2006, Methods in molecular biology.

[56]  D.,et al.  Regression Models and Life-Tables , 2022 .