Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types

Complex clinical phenotypes arise from the concerted interactions among the myriad components of a biological system. Therefore, comprehensive models can only be developed through the integrated study of multiple types of experimental data gathered from the system in question. The Random Foreststrade(RF) method is adept at identifying relevant features having only slight main effects in high-dimensional data. This method is well-suited to integrated analysis, as relevant attributes may be selected from categorical or continuous data, and there may be interactions across data types. RF is a natural approach for studying gene-gene, gene-protein, or protein-protein interactions because importance scores for particular attributes take interactions into account. Thus, Random Forests is a promising solution to the analysis challenge posed by high-dimensional datasets including interactions among attributes of different types. In this study, we characterize the performance of RF on a range of simulated genetic and/or proteomic datasets. We compare the performance of RF in identifying relevant attributes when given genetic data alone, proteomic data alone, or a combined dataset of genetic plus proteomic data. Our results indicate that utilizing multiple data types is beneficial when the disease model is complex and the phenotypic outcome-associated data type is unknown. The results of this study also show that RF is adept at identifying relevant features in high-dimensional data with small main effects and low heritability

[1]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[2]  P. Corey,et al.  Incidence of Adverse Drug Reactions in Hospitalized Patients , 2012 .

[3]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[4]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[5]  M. Province,et al.  Classification methods for confronting heterogeneity. , 2001, Advances in genetics.

[6]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[7]  T. Reich,et al.  A perspective on epistasis: limits of models displaying no main effect. , 2002, American journal of human genetics.

[8]  L. Hood Systems biology: integrating technology, biology, and computation , 2003, Mechanisms of Ageing and Development.

[9]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[10]  D. Tregouet,et al.  Automated detection of informative combined effects in genetic association studies of complex traits. , 2003, Genome research.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  David M. Reif,et al.  Integrated analysis of genetic, genomic and proteomic data , 2004, Expert review of proteomics.

[13]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[14]  Marko Robnik-Sikonja,et al.  Improving Random Forests , 2004, ECML.

[15]  J. Crowe,et al.  Adverse events after smallpox immunizations are associated with alterations in systemic cytokine levels. , 2004, The Journal of infectious diseases.

[16]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[17]  Ingo Ruczinski,et al.  Identifying interacting SNPs using Monte Carlo logic regression , 2005, Genetic epidemiology.

[18]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[19]  David M. Reif,et al.  Combinatorial Pharmacogenetics , 2005, Nature Reviews Drug Discovery.

[20]  Scott M. Williams,et al.  Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. , 2005, BioEssays : news and reviews in molecular, cellular and developmental biology.

[21]  David M. Reif,et al.  Machine Learning for Detecting Gene-Gene Interactions , 2006, Applied bioinformatics.

[22]  David M. Reif,et al.  Cytokine expression patterns associated with systemic adverse events following smallpox immunization. , 2006, The Journal of infectious diseases.