Screening large-scale association study data: exploiting interactions using random forests

BackgroundGenome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for futher study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction.ResultsKeeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact.ConclusionsIn the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.

[1]  N. Risch Linkage strategies for genetically complex traits. II. The power of affected relative pairs. , 1990, American journal of human genetics.

[2]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[3]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[6]  B. Weber,et al.  Founder BRCA1 and BRCA2 mutations in Ashkenazi Jews in Israel: frequency and differential penetrance in ovarian cancer and in breast-ovarian cancer families. , 1997, American journal of human genetics.

[7]  K. Lunetta,et al.  Using recursive partitioning for exploration and follow‐up of linkage and association analyses , 1999, Genetic epidemiology.

[8]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[9]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[10]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[11]  C Kooperberg,et al.  Sequence Analysis Using Logic Regression , 2001, Genetic epidemiology.

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[13]  M A Province,et al.  Tree‐based recursive partitioning methods for subdividing sibpairs into relatively more homogeneous subgroups , 2001, Genetic epidemiology.

[14]  M. Province,et al.  Classification methods for confronting heterogeneity. , 2001, Advances in genetics.

[15]  Using Data Mining to Address Heterogeneity in the Southampton Data , 2001, Genetic epidemiology.

[16]  Heping Zhang,et al.  Tree‐Based Linkage and Association Analyses of Asthma , 2001, Genetic epidemiology.

[17]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[18]  L J Eaves,et al.  Common Disease Analysis Using Multivariate Adaptive Regression Splines (MARS): Genetic Analysis Workshop 12 Simulated Sequence Data , 2001, Genetic epidemiology.

[19]  Locating disease genes using Bayesian variable selection with the Haseman-Elston method , 2003, BMC genetics.

[20]  Nengjun Yi,et al.  Stochastic search variable selection for identifying multiple quantitative trait loci. , 2003, Genetics.

[21]  Kenny Q. Ye,et al.  A Method for Evaluating the Results of Bayesian Model Selection: Application to Linkage Analyses of Attributes Determined by Two or More Genes , 2003, Human Heredity.

[22]  Xiangjun Gu,et al.  Use of tree-based models to identify subgroups and increase power to detect linkage to cardiovascular disease traits , 2003, BMC Genetics.

[23]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[24]  Josée Dupuis,et al.  Mapping complex traits using Random Forests , 2003, BMC Genetics.

[25]  Jason H. Moore,et al.  Power of multifactor dimensionality reduction for detecting gene‐gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity , 2003, Genetic epidemiology.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[28]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[29]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  A. Ashley-Koch Determining Genetic Component of a Disease , 2005 .