Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates

MOTIVATION Technological advances that allow routine identification of high-dimensional risk factors have led to high demand for statistical techniques that enable full utilization of these rich sources of information for genetics studies. Variable selection for censored outcome data as well as control of false discoveries (i.e. inclusion of irrelevant variables) in the presence of high-dimensional predictors present serious challenges. This article develops a computationally feasible method based on boosting and stability selection. Specifically, we modified the component-wise gradient boosting to improve the computational feasibility and introduced random permutation in stability selection for controlling false discoveries. RESULTS We have proposed a high-dimensional variable selection method by incorporating stability selection to control false discovery. Comparisons between the proposed method and the commonly used univariate and Lasso approaches for variable selection reveal that the proposed method yields fewer false discoveries. The proposed method is applied to study the associations of 2339 common single-nucleotide polymorphisms (SNPs) with overall survival among cutaneous melanoma (CM) patients. The results have confirmed that BRCA2 pathway SNPs are likely to be associated with overall survival, as reported by previous literature. Moreover, we have identified several new Fanconi anemia (FA) pathway SNPs that are likely to modulate survival of CM patients. AVAILABILITY AND IMPLEMENTATION The related source code and documents are freely available at https://sites.google.com/site/bestumich/issues. CONTACT yili@umich.edu.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Hongzhe Li,et al.  Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data , 2005, Bioinform..

[3]  Victor G Prieto,et al.  Influence of single nucleotide polymorphisms in the MMP1 promoter region on cutaneous melanoma progression , 2012, Melanoma research.

[4]  Jeffrey E Gershenwald,et al.  Final version of 2009 AJCC melanoma staging and classification. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[5]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[6]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  Pascal O. Zinn,et al.  Upregulation of Fanconi Anemia DNA Repair Genes in Melanoma Compared to Non-Melanoma Skin Cancer , 2011, The Journal of investigative dermatology.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Kenneth Lange,et al.  Stability selection for genome‐wide association , 2011, Genetic epidemiology.

[11]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[12]  Yongzhao Shao,et al.  Melanoma risk loci as determinants of melanoma recurrence and survival , 2013, Journal of Translational Medicine.

[13]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[14]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[15]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Bradley Efron,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Rejoinder. , 2008, 0808.0572.

[18]  R. Tothill,et al.  Exome Sequencing Identifies Rare Deleterious Mutations in DNA Repair Genes FANCC and BLM as Potential Breast Cancer Susceptibility Alleles , 2012, PLoS genetics.

[19]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[20]  Maureen E. Hoatlin,et al.  Targeting the Fanconi Anemia Pathway to Identify Tailored Anticancer Therapeutics , 2012, Anemia.

[21]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[24]  G. Ridgeway The State of Boosting ∗ , 1999 .

[25]  Sarah-Jane Schramm,et al.  Melanoma Prognosis: A REMARK-Based Systematic Review and Bioinformatic Analysis of Immunohistochemical and Gene Microarray Studies , 2011, Molecular Cancer Therapeutics.

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[28]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[29]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[30]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[31]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[32]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[33]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[34]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[35]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[36]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[37]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[38]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[39]  T. Hucl,et al.  DNA repair: exploiting the Fanconi anemia pathway as a potential therapeutic target. , 2011, Physiological research.

[40]  Jeffrey E. Lee,et al.  Genetic variants in Fanconi Anemia Pathway Genes BRCA2 and FANCA Predict Melanoma Survival , 2014, The Journal of investigative dermatology.

[41]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[42]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[43]  Bradley Efron,et al.  Large-scale inference , 2010 .

[44]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[45]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[46]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .