Gradient Boosting as a SNP Filter: an Evaluation Using Simulated and Hair Morphology Data

Typically, genome-wide association studies consist of regressing the phenotype on each SNP separately using an additive genetic model. Although statistical models for recessive, dominant, SNP-SNP, or SNP-environment interactions exist, the testing burden makes an evaluation of all possible effects impractical for genome-wide data. We advocate a two-step approach where the first step consists of a filter that is sensitive to different types of SNP main and interactions effects. The aim is to substantially reduce the number of SNPs such that more specific modeling becomes feasible in a second step. We provide an evaluation of a statistical learning method called “gradient boosting machine” (GBM) that can be used as a filter. GBM does not require an a priori specification of a genetic model, and permits inclusion of large numbers of covariates. GBM can therefore be used to explore multiple GxE interactions, which would not be feasible within the parametric framework used in GWAS. We show in a simulation that GBM performs well even under conditions favorable to the standard additive regression model commonly used in GWAS, and is sensitive to the detection of interaction effects even if one of the interacting variables has a zero main effect. The latter would not be detected in GWAS. Our evaluation is accompanied by an analysis of empirical data concerning hair morphology. We estimate the phenotypic variance explained by increasing numbers of highest ranked SNPs, and show that it is sufficient to select 10K-20K SNPs in the first step of a two-step approach.

[1]  Sang Hong Lee,et al.  A Simple and Fast Two-Locus Quality Control Test to Detect False Positives Due to Batch Effects in Genome-Wide Association Studies , 2010, Genetic epidemiology.

[2]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[3]  Jason H. Moore,et al.  Missing heritability and strategies for finding the underlying causes of complex disease , 2010, Nature Reviews Genetics.

[4]  D. Nyholt A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. , 2004, American journal of human genetics.

[5]  Brian T. Naughton,et al.  Web-Based, Participant-Driven Studies Yield Novel Genetic Associations for Common Traits , 2010, PLoS genetics.

[6]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  R. Carroll,et al.  Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants , 2011, Proceedings of the National Academy of Sciences.

[9]  Runze Li,et al.  A dynamic model for genome-wide association studies , 2011, Human Genetics.

[10]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[11]  N. Martin,et al.  Estimating the Heritability of Hair Curliness in Twins of European Ancestry , 2009, Twin Research and Human Genetics.

[12]  E. Lander,et al.  The mystery of missing heritability: Genetic interactions create phantom heritability , 2012, Proceedings of the National Academy of Sciences.

[13]  T. Spector,et al.  A genome-wide association study for myopia and refractive error identifies a susceptibility locus at 15q25 , 2010, Nature Genetics.

[14]  Conor V Dolan,et al.  Genetic Association in Multivariate Phenotypic Data: Power in Five Models , 2010, Twin Research and Human Genetics.

[15]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[16]  P. Albert,et al.  Models for longitudinal data: a generalized estimating equation approach. , 1988, Biometrics.

[17]  W. Johnson,et al.  Heritability in the Era of Molecular Genetics: Some Thoughts for Understanding Genetic Influences on Behavioural Traits , 2011 .

[18]  M. Neale,et al.  An integrated phenomic approach to multivariate allelic association , 2010, European Journal of Human Genetics.

[19]  R. Krueger,et al.  Toward scientifically useful quantitative models of psychopathology: The importance of a comparative approach , 2010, Behavioral and Brain Sciences.

[20]  J. Hirschhorn,et al.  Genetic model testing and statistical power in population‐based association studies of quantitative traits , 2007, Genetic epidemiology.

[21]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[22]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[23]  D. Parry,et al.  Trichohyalin Mechanically Strengthens the Hair Follicle , 2003, Journal of Biological Chemistry.

[24]  Jing Li,et al.  Detecting epistatic effects in association studies at a genomic level based on an ensemble approach , 2011, Bioinform..

[25]  N. Geller,et al.  Robust ranks of true associations in genome-wide case-control association studies , 2015 .

[26]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[27]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[28]  W. G. Hill,et al.  Genome partitioning of genetic variation for complex traits using common SNPs , 2011, Nature Genetics.

[29]  Gitta H. Lubke,et al.  An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data , 2012, Bioinform..

[30]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[31]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[32]  Nilanjan Chatterjee,et al.  Estimation of effect size distribution from genome-wide association studies and implications for future discoveries , 2010, Nature Genetics.

[33]  J. Friedman Stochastic gradient boosting , 2002 .

[34]  D. Posthuma,et al.  Phenotypic Complexity, Measurement Bias, and Poor Phenotypic Resolution Contribute to the Missing Heritability Problem in Genetic Association Studies , 2010, PloS one.

[35]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[36]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[37]  K Y Liang,et al.  Longitudinal data analysis for discrete and continuous outcomes. , 1986, Biometrics.

[38]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[39]  S. Thibaut,et al.  Human hair shape is programmed from the bulb , 2005, The British journal of dermatology.

[40]  D. Parry,et al.  The structure of human trichohyalin. Potential multiple roles as a functional EF-hand-like calcium-binding protein, a cornified cell envelope precursor, and an intermediate filament-associated (cross-linking) protein. , 1993, The Journal of biological chemistry.

[41]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[42]  David N Cooper,et al.  GWAS: heritability missing in action? , 2010, European Journal of Human Genetics.

[43]  D. Shander,et al.  Apoptosis in the hair follicle. , 2006, The Journal of investigative dermatology.

[44]  J. Rothnagel,et al.  Trichohyalin, an intermediate filament-associated protein of the hair follicle , 1986, The Journal of cell biology.

[45]  A. Morris,et al.  Data quality control in genetic case-control association studies , 2010, Nature Protocols.

[46]  Hon-Cheong So,et al.  Robust Association Tests Under Different Genetic Models, Allowing for Binary or Quantitative Traits and Covariates , 2011, Behavior genetics.

[47]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[48]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[49]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[50]  L. Penke,et al.  Heritability in the Era of Molecular Genetics: Some Thoughts for Understanding Genetic Influences on Behavioural Traits: Understanding heritability , 2011 .

[51]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[52]  Manuel A. R. Ferreira,et al.  Common variants in the trichohyalin gene are associated with straight hair in Europeans. , 2009, American journal of human genetics.

[53]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.