An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data

MOTIVATION There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage disequilibrium (LD). To establish RF and GBM as viable alternatives for analyzing genome-wide data, it is necessary to address this potential bias and show that SL methods do not significantly under-perform conventional GWAS methods. RESULTS Both LD and MAF have a significant impact on the variable importance measures commonly used in RF and GBM. Dividing SNPs into overlapping subsets with approximate linkage equilibrium and applying SL methods to each subset successfully reduces the impact of LD. A welcome side effect of this approach is a dramatic reduction in parallel computing time, increasing the feasibility of applying SL methods to large datasets. The created subsets also facilitate a potential correction for the effect of MAF using pseudocovariates. Simulations using simulated SNPs embedded in empirical data-assessing varying effect sizes, minor allele frequencies and LD patterns-suggest that the sensitivity to detect effects is often improved by subsetting and does not significantly under-perform the Armitage trend test, even under ideal conditions for the trend test. AVAILABILITY Code for the LD subsetting algorithm and pseudocovariate correction is available at http://www.nd.edu/~glubke/code.html.

[1]  Andreas Ziegler,et al.  A Statistical Approach to Genetic Epidemiology: With Access to E-Learning Platform by Friedrich Pahlke , 2010 .

[2]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[3]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[4]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[5]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[6]  R. Carroll,et al.  Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants , 2011, Proceedings of the National Academy of Sciences.

[7]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[8]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[9]  A. Foulkes,et al.  Application of two machine learning algorithms to genetic association studies in the presence of covariates , 2008, BMC Genetics.

[10]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[11]  Heping Zhang,et al.  Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests , 2009 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[14]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[15]  Nilanjan Chatterjee,et al.  Estimation of effect size distribution from genome-wide association studies and implications for future discoveries , 2010, Nature Genetics.

[16]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[17]  Paola Zuccolotto,et al.  Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms , 2010, Stat. Comput..

[18]  Marco Sandri,et al.  A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees , 2008 .

[19]  G. Rosner,et al.  A modified forward multiple regression in high‐density genome‐wide association studies for complex traits , 2009, Genetic epidemiology.

[20]  Satish Chikkagoudar,et al.  Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest , 2011, Nucleic acids research.

[21]  Hon-Cheong So,et al.  Uncovering the total heritability explained by all true susceptibility variants in a genome‐wide association study , 2011, Genetic epidemiology.

[22]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[23]  I. König,et al.  A Statistical Approach to Genetic Epidemiology: Concepts and Applications , 2006 .

[24]  Manuel A. R. Ferreira,et al.  Common variants in the trichohyalin gene are associated with straight hair in Europeans. , 2009, American journal of human genetics.

[25]  Guimei Liu,et al.  An empirical comparison of several recent epistatic interaction detection methods , 2011, Bioinform..

[26]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[27]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[28]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[29]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[30]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[31]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[32]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[33]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[34]  K. Frazer,et al.  Human genetic variation and its contribution to complex traits , 2009, Nature Reviews Genetics.

[35]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[36]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[37]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[38]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[39]  Ricardo Cao,et al.  Evaluating the Ability of Tree‐Based Methods and Logistic Regression for the Detection of SNP‐SNP Interaction , 2009, Annals of human genetics.

[40]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[41]  Guifang Fu,et al.  The Bayesian lasso for genome-wide association studies , 2011, Bioinform..

[42]  Andreas Ziegler and Inke R. Konig,et al.  A statistical approach to genetic epidemiology , 2013 .

[43]  Atanu Biswas,et al.  A new bivariate binomial distribution , 2002 .

[44]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[45]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[46]  K. Roeder,et al.  Screen and clean: a tool for identifying interactions in genome‐wide association studies , 2010, Genetic epidemiology.

[47]  Jing Li,et al.  Detecting epistatic effects in association studies at a genomic level based on an ensemble approach , 2011, Bioinform..

[48]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[49]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[50]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[51]  Yan V. Sun,et al.  Machine learning in genome‐wide association studies , 2009, Genetic epidemiology.

[52]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[53]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[54]  Hans-Peter Piepho,et al.  A comparison of random forests, boosting and support vector machines for genomic selection , 2011, BMC proceedings.