Performance of random forest when SNPs are in linkage disequilibrium

BackgroundSingle nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF.ResultsWe evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype.ConclusionOur results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.

[1]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[2]  Winnie S. Liang,et al.  GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers , 2007, Neuron.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[5]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[6]  Winnie S. Liang,et al.  GAB2 alleles modify Alzheimer's risk in APOE epsilon4 carriers. , 2007, Neuron.

[7]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[8]  Qiong Yang,et al.  Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks , 2007, BMC proceedings.

[9]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[10]  N. Risch Linkage strategies for genetically complex traits. II. The power of affected relative pairs. , 1990, American journal of human genetics.

[11]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[12]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[13]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[14]  D. Schaid,et al.  Score tests for association between traits and haplotypes when linkage phase is ambiguous. , 2002, American journal of human genetics.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.