High‐Dimensional Variable Selection in Meta‐Analysis for Censored Data

This article considers the problem of selecting predictors of time to an event from a high-dimensional set of candidate predictors using data from multiple studies. As an alternative to the current multistage testing approaches, we propose to model the study-to-study heterogeneity explicitly using a hierarchical model to borrow strength. Our method incorporates censored data through an accelerated failure time model. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for high-dimensional predictors. For model fitting, we develop a Monte Carlo expectation maximization (MC-EM) algorithm to accommodate censored data. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. We compare our method with some commonly used procedures through simulation studies. We also illustrate the method using the gene expression barcode data from three breast cancer studies.

[1]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[2]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[3]  Laurence L. George,et al.  The Statistical Analysis of Failure Time Data , 2003, Technometrics.

[4]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[5]  Hao Helen Zhang,et al.  Adaptive Lasso for Cox's proportional hazards model , 2007 .

[6]  Marina Vannucci,et al.  Bioinformatics Original Paper Bayesian Variable Selection for the Analysis of Microarray Data with Censored Outcomes , 2022 .

[7]  L. Holmberg,et al.  Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts , 2005, Breast Cancer Research.

[8]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[9]  J. V. Ryzin,et al.  Regression Analysis with Randomly Right-Censored Data , 1981 .

[10]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[11]  M. West,et al.  Shotgun Stochastic Search for “Large p” Regression , 2007 .

[12]  Susmita Datta,et al.  Predicting Patient Survival from Microarray Data by Accelerated Failure Time Modeling Using Partial Least Squares and LASSO , 2007, Biometrics.

[13]  I. James,et al.  Linear regression with censored data , 1979 .

[14]  Michael B Brenner,et al.  Characterization of two avian MHC-like genes reveals an ancient origin of the CD1 family. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  J. Kalbfleisch,et al.  The Statistical Analysis of Failure Time Data , 1980 .

[16]  David B. Dunson,et al.  Multitask Compressive Sensing , 2009, IEEE Transactions on Signal Processing.

[17]  Bin Nan,et al.  Doubly Penalized Buckley–James Method for Survival Data with High‐Dimensional Covariates , 2008, Biometrics.

[18]  Lee-Jen Wei,et al.  The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. , 1992, Statistics in medicine.

[19]  M. West On scale mixtures of normal distributions , 1987 .

[20]  Arnoldo Frigessi,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm305 Gene expression Predicting survival from microarray data—a comparative study , 2022 .

[21]  Fraser Cummings,et al.  Two‐stage candidate gene study of chromosome 3p demonstrates an association between nonsynonymous variants in the MST1R gene and Crohn's disease , 2008, Inflammatory bowel diseases.

[22]  Steven J. M. Jones,et al.  Meta-analysis of Colorectal Cancer Gene Expression Profiling Studies Identifies Consistently Reported Candidate Biomarkers , 2008, Cancer Epidemiology Biomarkers & Prevention.

[23]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[24]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[25]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[26]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[27]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[28]  Michael E. Tipping,et al.  Fast Marginal Likelihood Maximisation for Sparse Bayesian Models , 2003 .

[29]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .