Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data

Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.

[1]  Taesung Park,et al.  Phenotype prediction from genome-wide association studies: application to smoking behaviors , 2012, BMC Systems Biology.

[2]  F. Dudbridge Power and Predictive Accuracy of Polygenic Risk Scores , 2013, PLoS genetics.

[3]  Cornelia M van Duijn,et al.  Genome-based prediction of common diseases: advances and prospects. , 2008, Human molecular genetics.

[4]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[5]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[6]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[9]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[10]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[11]  Greg Gibson,et al.  Rare and common variants: twenty arguments , 2012, Nature Reviews Genetics.

[12]  Yongdai Kim,et al.  Sparse bridge estimation with a diverging number of parameters , 2013 .

[13]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[14]  Naomi R. Wray,et al.  Novel Genetic Analysis for Case-Control Genome-Wide Association Studies: Quantification of Power and Genomic Prediction Accuracy , 2013, PloS one.

[15]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[16]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[17]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[18]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[19]  Peter Kraft,et al.  Evaluation of polygenic risk scores for predicting breast and prostate cancer risk , 2011, Genetic epidemiology.

[20]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[21]  Sang Hong Lee,et al.  Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood , 2012, Bioinform..

[22]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[23]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[24]  Peter Hall,et al.  BOOTSTRAP-BASED PENALTY CHOICE FOR THE LASSO , ACHIEVING ORACLE PERFORMANCE , 2009 .

[25]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[26]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.

[27]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  P. Visscher,et al.  Estimating missing heritability for disease from genome-wide association studies. , 2011, American journal of human genetics.

[30]  Jinchi Lv,et al.  High dimensional thresholded regression and shrinkage effect , 2014, 1605.03306.

[31]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[32]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[33]  Peter M Visscher,et al.  Prediction of individual genetic risk to disease from genome-wide association studies. , 2007, Genome research.

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[35]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[36]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[37]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[38]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[39]  Sunghoon Kwon,et al.  Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data , 2006, Comput. Stat. Data Anal..

[40]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[41]  Stephan Ripke,et al.  Estimation of SNP heritability from dense genotype data. , 2013, American journal of human genetics.

[42]  Naomi R. Wray,et al.  Estimation and partitioning of polygenic variation captured by common SNPs for Alzheimer's disease, multiple sclerosis and endometriosis , 2012, Human molecular genetics.

[43]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[44]  T. Fearn Ridge Regression , 2013 .

[45]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[46]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[47]  Tong Zhang,et al.  Analysis of Multi-stage Convex Relaxation for Sparse Regularization , 2010, J. Mach. Learn. Res..

[48]  Runze Li,et al.  CALIBRATING NON-CONVEX PENALIZED REGRESSION IN ULTRA-HIGH DIMENSION. , 2013, Annals of statistics.

[49]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[50]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[51]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[52]  Xinwei Deng,et al.  Estimation in high-dimensional linear models with deterministic design matrices , 2012, 1206.0847.