On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data

MOTIVATION Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene-gene and gene-environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. RESULTS Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. AVAILABILITY The RJ software package is freely available at http://www.randomjungle.org

[1]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[2]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[3]  Judy H. Cho,et al.  Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[4]  Tudor M. Baetu,et al.  Disruption of NF-κB Signaling Reveals a Novel Role for NF-κB in the Regulation of TNF-Related Apoptosis-Inducing Ligand Expression1 , 2001, The Journal of Immunology.

[5]  Judy H Cho,et al.  Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis , 2007, Nature Genetics.

[6]  Xin Li,et al.  Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15 , 2007, Genetic epidemiology.

[7]  Nicola J. Rinaldi,et al.  Control of Pancreas and Liver Gene Expression by HNF Transcription Factors , 2004, Science.

[8]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[9]  I R König,et al.  Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[10]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[13]  Carolin Strobl,et al.  The behaviour of random forest permutation-based variable importance measures under predictor correlation , 2010, BMC Bioinformatics.

[14]  C. Gieger,et al.  Genomewide association analysis of coronary artery disease. , 2007, The New England journal of medicine.

[15]  Yasunori Ogura,et al.  Induction of Nod2 in Myelomonocytic and Intestinal Epithelial Cells via Nuclear Factor-κB Activation* , 2002, The Journal of Biological Chemistry.

[16]  Yan V. Sun,et al.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests , 2007, BMC proceedings.

[17]  Andreas Ziegler,et al.  A Statistical Approach to Genetic Epidemiology: With Access to E-Learning Platform by Friedrich Pahlke , 2010 .

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  J. Houghton,et al.  Rottlerin sensitizes colon carcinoma cells to tumor necrosis factor-related apoptosis-inducing ligand-induced apoptosis via uncoupling of the mitochondria independent of protein kinase C. , 2003, Cancer research.

[20]  Silke Szymczak,et al.  Evaluation of single-nucleotide polymorphism imputation using random forests , 2009, BMC proceedings.

[21]  Suet Yi Leung,et al.  Parallels between global transcriptional programs of polarizing Caco-2 intestinal epithelial cells in vitro and gene expression programs in normal colon and colon cancer. , 2007, Molecular biology of the cell.

[22]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[25]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[26]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[27]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[28]  H. Adami,et al.  Increased risk of large-bowel cancer in Crohn's disease with colonic involvement , 1990, The Lancet.

[29]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[30]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[31]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[32]  R. Jove,et al.  Overexpression of a dominant-negative signal transducer and activator of transcription 3 variant in tumor cells leads to production of soluble factors that induce apoptosis and cell cycle arrest. , 2001, Cancer research.

[33]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[34]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[35]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[36]  Na Li,et al.  Genetic Analysis Workshop 15: simulation of a complex genetic model for rheumatoid arthritis in nuclear families including a dense SNP map with linkage disequilibrium between marker loci and trait loci , 2007, BMC Proceedings.

[37]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[40]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[41]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[42]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[43]  K R Abrams,et al.  Meta‐analysis: colorectal and small bowel cancer risk in patients with Crohn's disease , 2006, Alimentary pharmacology & therapeutics.

[44]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[45]  I. König,et al.  A Statistical Approach to Genetic Epidemiology: Concepts and Applications , 2006 .

[46]  Jing Xu,et al.  Sp1-mediated TRAIL induction in chemosensitization. , 2008, Cancer research.

[47]  J. Ott,et al.  Selecting SNPs in two‐stage analysis of disease association data: a model‐free approach , 2000, Annals of human genetics.

[48]  F. Sinicrope,et al.  Cyclooxygenase-2 overexpression inhibits death receptor 5 expression and confers resistance to tumor necrosis factor-related apoptosis-inducing ligand-induced apoptosis in human colon cancer cells. , 2002, Cancer research.

[49]  Daniel E. Weeks,et al.  Interpretation of Genetic Association Studies: Markers with Replicated Highly Significant Odds Ratios May Be Poor Classifiers , 2009, PLoS genetics.

[50]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[51]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[52]  T. Mcclanahan,et al.  A Receptor for the Heterodimeric Cytokine IL-23 Is Composed of IL-12Rβ1 and a Novel Cytokine Receptor Subunit, IL-23R1 , 2002, The Journal of Immunology.

[53]  Xiang Chen,et al.  Willows: a memory efficient tree and forest construction package , 2009, BMC Bioinformatics.

[54]  B. McKinney,et al.  Capturing the Spectrum of Interaction Effects in Genetic Association Studies by Simulated Evaporative Cooling Network Analysis , 2009, PLoS genetics.

[55]  Xu Cao,et al.  Sp1 Transcription Factor as a Molecular Target for Nitric Oxide– and Cyclic Nucleotide–Mediated Suppression of cGMP-Dependent Protein Kinase-I&agr; Expression in Vascular Smooth Muscle Cells , 2002, Circulation research.

[56]  Irmela Jeremias,et al.  Tumor necrosis factor-related apoptosis-inducing ligand-mediated proliferation of tumor cells with receptor-proximal apoptosis defects. , 2005, Cancer research.

[57]  Alexander R. Pico,et al.  Pathway Analysis of Single-Nucleotide Polymorphisms Potentially Associated with Glioblastoma Multiforme Susceptibility Using Random Forests , 2008, Cancer Epidemiology Biomarkers & Prevention.