Bioinformatics Original Paper Bayesian Variable Selection for the Analysis of Microarray Data with Censored Outcomes

MOTIVATION A common task in microarray data analysis consists of identifying genes associated with a phenotype. When the outcomes of interest are censored time-to-event data, standard approaches assess the effect of genes by fitting univariate survival models. In this paper, we propose a Bayesian variable selection approach, which allows the identification of relevant markers by jointly assessing sets of genes. We consider accelerated failure time (AFT) models with log-normal and log-t distributional assumptions. A data augmentation approach is used to impute the failure times of censored observations and mixture priors are used for the regression coefficients to identify promising subsets of variables. The proposed method provides a unified procedure for the selection of relevant genes and the prediction of survivor functions. RESULTS We demonstrate the performance of the method on simulated examples and on several microarray datasets. For the simulation study, we consider scenarios with large number of noisy variables and different degrees of correlation between the relevant and non-relevant (noisy) variables. We are able to identify the correct covariates and obtain good prediction of the survivor functions. For the microarray applications, some of our selected genes are known to be related to the diseases under study and a few are in agreement with findings from other researchers. AVAILABILITY The Matlab code for implementing the Bayesian variable selection method may be obtained from the corresponding author. CONTACT mvannucci@stat.tamu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Alan E. Gelfand,et al.  Model Determination using sampling-based methods , 1996 .

[2]  A. O'Hagan,et al.  The Calculation of Posterior Distributions by Data Augmentation: Comment , 1987 .

[3]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[4]  Rameen Beroukhim,et al.  Molecular characterization of the tumor microenvironment in breast cancer. , 2004, Cancer cell.

[5]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[6]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[7]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[8]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[9]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[10]  Lynn Kuo,et al.  Bayesian semiparametric inference for the accelerated failure‐time model , 1997 .

[11]  Danh V. Nguyen,et al.  Partial least squares proportional hazard regression for application to DNA microarray survival data , 2002, Bioinform..

[12]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[13]  D. Madigan,et al.  Bayesian Model Averaging in Proportional Hazard Models: Assessing the Risk of a Stroke , 1997 .

[14]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[15]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[16]  J Hermans,et al.  Clinical relevance of BCL2, BCL6, and MYC rearrangements in diffuse large B-cell lymphoma. , 1998, Blood.

[17]  Bani K. Mallick,et al.  Bayesian methods for variable selection in survival models with application to DNA microarray data , 2004 .

[18]  Edward I. George,et al.  The Practical Implementation of Bayesian Model Selection , 2001 .

[19]  D.,et al.  Regression Models and Life-Tables , 2022 .

[20]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[21]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[22]  Ronald Christensen,et al.  Modelling accelerated failure time with a Dirichlet process , 1988 .

[23]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[24]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[25]  Lee-Jen Wei,et al.  The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. , 1992, Statistics in medicine.

[26]  T R Holford,et al.  A stepwise variable selection procedure for nonlinear regression models. , 1980, Biometrics.

[27]  Danh V. Nguyen,et al.  Assessing Patient Survival Using Microarray Gene Expression Data Via Partial Least Squares Proportional Hazard Regression , 2003 .

[28]  G M Sanderson,et al.  Cloning and Characterization of GRB14, a Novel Member of the GRB7 Gene Family (*) , 1996, The Journal of Biological Chemistry.

[29]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[30]  D. Lindley A STATISTICAL PARADOX , 1957 .

[31]  E J Bedrick,et al.  Bayesian accelerated failure time analysis with application to veterinary epidemiology. , 2000, Statistics in medicine.

[32]  D. Lindley The Choice of Variables in Multiple Regression , 1968 .

[33]  Marina Vannucci,et al.  Bayesian Variable Selection in Multinomial Probit Models to Identify Molecular Signatures of Disease Stage , 2004, Biometrics.

[34]  D. Cox Regression Models and Life-Tables , 1972 .

[35]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[36]  T. Fearn,et al.  Multivariate Bayesian variable selection and prediction , 1998 .

[37]  S. Kumar,et al.  Prognostic significance of TGF beta 1 and TGF beta 3 in human breast carcinoma. , 2000, Anticancer research.

[38]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[39]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[40]  D Faraggi,et al.  Bayesian variable selection method for censored survival data. , 1998, Biometrics.

[41]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.