Use of Wrapper Algorithms Coupled with a Random Forests Classifier for Variable Selection in Large-Scale Genomic Association Studies

Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a "wrapper" strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.

[1]  Joanna L. Kelley,et al.  Positive selection in the human genome: from genome scans to biological significance. , 2008, Annual review of genomics and human genetics.

[2]  Hans C van Houwelingen,et al.  Analysis of multiple SNPs in genetic association studies: comparison of three multi‐locus methods to prioritize and select SNPs , 2007, Genetic epidemiology.

[3]  A. Folsom,et al.  The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. , 1989, American journal of epidemiology.

[4]  Gregory Piatetsky-Shapiro,et al.  Microarray data mining: facing the challenges , 2003, SKDD.

[5]  Ting Wang,et al.  Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules , 2004, Multiple Classifier Systems.

[6]  A. Percus,et al.  Nature's Way of Optimizing , 1999, Artif. Intell..

[7]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[8]  J. Ott,et al.  Complement Factor H Polymorphism in Age-Related Macular Degeneration , 2005, Science.

[9]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[10]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[12]  Andrew G. Clark,et al.  Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels , 2005, J. Comput. Biol..

[13]  J. Ott,et al.  Mathematical multi-locus approaches to localizing complex human trait genes , 2003, Nature Reviews Genetics.

[14]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[15]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[16]  M. Province,et al.  19 Classification methods for confronting heterogeneity , 2001 .

[17]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[20]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[21]  J. Ott,et al.  Trimming, weighting, and grouping SNPs in human case-control association studies. , 2001, Genome research.

[22]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[23]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[24]  H. Dalton The Measurement of the Inequality of Incomes , 1920 .

[25]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[26]  T J White,et al.  Positive selection in the human genome inferred from human-chimp-mouse orthologous gene alignments. , 2003, Cold Spring Harbor symposia on quantitative biology.

[27]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[28]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[29]  Yoram Singer,et al.  A simple, fast, and effective rule learner , 1999, AAAI 1999.

[30]  C. Gini Measurement of Inequality of Incomes , 1921 .

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[33]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[34]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..