论文信息 - On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

MOTIVATION Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene-gene and gene-environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. RESULTS Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. AVAILABILITY The RJ software package is freely available at http://www.randomjungle.org

Andreas Ziegler | Daniel F. Schwarz | Inke R. König

[1] Susan A. Murphy,et al. Monographs on statistics and applied probability , 1990 .

[2] P. Donnelly,et al. Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[3] Judy H. Cho,et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[4] Tudor M. Baetu,et al. Disruption of NF-κB Signaling Reveals a Novel Role for NF-κB in the Regulation of TNF-Related Apoptosis-Inducing Ligand Expression1 , 2001, The Journal of Immunology.

[5] Judy H Cho,et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis , 2007, Nature Genetics.

[6] Xin Li,et al. Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15 , 2007, Genetic epidemiology.

[7] Nicola J. Rinaldi,et al. Control of Pancreas and Liver Gene Expression by HNF Transcription Factors , 2004, Science.

[8] I. König,et al. Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[9] I R König,et al. Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[10] Yi Yu,et al. Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[11] Andy Liaw,et al. Classification and Regression by randomForest , 2007 .