Variable Selection with False Discovery Control

Technological advances that allow routine identification of high-dimensional risk factors have led to high demand for statistical techniques that enable full utilization of these rich sources of information for genome-wide association studies (GWAS). Variable selection for censored outcome data as well as control of false discoveries (i.e. inclusion of irrelevant variables) in the presence of highdimensional predictors present serious challenges. In the context of survival analysis with high-dimensional covariates, this paper develops a computationally feasible method for building general risk prediction models, while controlling false discoveries. We have proposed a high-dimensional variable selection method by incorporating stability selection to control false discovery. Comparisons between the proposed method and the commonly used univariate and Lasso approaches for variable selection reveal that the proposed method yields fewer false discoveries. The proposed method is applied to study the associations of 2,339 common single-nucleotide polymorphisms (SNPs) with overall survival among cutaneous melanoma (CM) patients. The results have confirmed that BRCA2 pathway SNPs are likely to be associated with overall survival, as reported by previous literature. Moreover, we have identified several new Fanconi anemia (FA) pathway SNPs that are likely to modulate survival of CM patients. Variable selection with false discovery control Kevin He ,Yanming Li , Ji Zhu , Hongliang Liu , Jeffrey E. Lee , Christopher I. Amos , Terry Hyslop , Jiashun Jin , Qinyi Wei 3 and Yi Li 1 Department of Biostatistics, University of Michigan, Ann Arbor, Michigan 48109, USA. Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA. Department of Medicine, Duke University School of Medicine and Duke Cancer Institute, Duke University Medical Center, Durham, North Carolina, 27710, USA. Department of Surgical Oncology, The University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA. Department of Community and Family Medicine, Geisel School of Medicine, Dartmouth College, Hanover, NH, 03750, USA. Department of Biostatistics and Bioinformatics, Duke University; and Duke Clinical Research Institute, Durham, North Carolina, 27710, USA. Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, 15213, USA

[1]  M. Hurles,et al.  Exome Sequencing Identifies Rare Variants in Multiple Genes in Atrioventricular Septal Defect , 2015, Genetics in Medicine.

[2]  Jeffrey E. Lee,et al.  Genetic variants in Fanconi Anemia Pathway Genes BRCA2 and FANCA Predict Melanoma Survival , 2014, The Journal of investigative dermatology.

[3]  Yongzhao Shao,et al.  Melanoma risk loci as determinants of melanoma recurrence and survival , 2013, Journal of Translational Medicine.

[4]  R. Tothill,et al.  Exome Sequencing Identifies Rare Deleterious Mutations in DNA Repair Genes FANCC and BLM as Potential Breast Cancer Susceptibility Alleles , 2012, PLoS genetics.

[5]  Maureen E. Hoatlin,et al.  Targeting the Fanconi Anemia Pathway to Identify Tailored Anticancer Therapeutics , 2012, Anemia.

[6]  Victor G Prieto,et al.  Influence of single nucleotide polymorphisms in the MMP1 promoter region on cutaneous melanoma progression , 2012, Melanoma research.

[7]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[8]  Sarah-Jane Schramm,et al.  Melanoma Prognosis: A REMARK-Based Systematic Review and Bioinformatic Analysis of Immunohistochemical and Gene Microarray Studies , 2011, Molecular Cancer Therapeutics.

[9]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[10]  Pascal O. Zinn,et al.  Upregulation of Fanconi Anemia DNA Repair Genes in Melanoma Compared to Non-Melanoma Skin Cancer , 2011, The Journal of investigative dermatology.

[11]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[12]  T. Hucl,et al.  DNA repair: exploiting the Fanconi anemia pathway as a potential therapeutic target. , 2011, Physiological research.

[13]  Bradley Efron,et al.  Large-scale inference , 2010 .

[14]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[15]  Jeffrey E Gershenwald,et al.  Final version of 2009 AJCC melanoma staging and classification. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[16]  Yoav Benjamini,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Comment. , 2008 .

[17]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[18]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[19]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[20]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[21]  Hongzhe Li,et al.  Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data , 2005, Bioinform..

[22]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[23]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[26]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[27]  Jianqing Fan,et al.  Variable Selection for Cox's proportional Hazards Model and Frailty Model , 2002 .

[28]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[29]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[30]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[31]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[32]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[33]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[34]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[37]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[38]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.