Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations

The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, namely that common variants are systematically favoured by the widely used Gini VIM. As a consequence, researchers may overlook rare variants that contribute to the missing heritability. The goal of the present article is 3-fold: (i) to assess this effect quantitatively using simulation studies for different types of random forests (classical random forests and conditional inference forests, that employ unbiased variable selection criteria) as well as for different importance measures (Gini and permutation based); (ii) to explore the trees and to compare the behaviour of random forests and the standard logistic regression model in order to understand the statistical mechanisms behind the preference for common variants; and (iii) to summarize these results and previously investigated properties of random forest VIMs in the context of genetic association studies and to make practical recommendations regarding the choice of the random forest and variable importance type. All our analyses can be reproduced using R code available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/ginibias/.

[1]  Satish Chikkagoudar,et al.  Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest , 2011, Nucleic acids research.

[2]  P. Gregersen,et al.  Supervised machine learning and logistic regression identifies novel epistatic risk factors with PTPN22 for rheumatoid arthritis , 2010, Genes and Immunity.

[3]  K. Hornik,et al.  party : A Laboratory for Recursive Partytioning , 2009 .

[4]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[5]  C Charles Gu,et al.  Selection of important variables by statistical learning in genome-wide association analysis , 2009, BMC proceedings.

[6]  J. Carulli,et al.  A genome-wide screen of gene–gene interactions for rheumatoid arthritis susceptibility , 2011, Human Genetics.

[7]  Kristin K. Nicodemus,et al.  Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures , 2011, Briefings Bioinform..

[8]  Mariza de Andrade,et al.  Identification of genes and haplotypes that predict rheumatoid arthritis using random forests , 2009, BMC proceedings.

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  A. G. Heidema,et al.  The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases , 2006, BMC Genetics.

[11]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[12]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[13]  Pierre Geurts,et al.  A screening methodology based on Random Forests to improve the detection of gene–gene interactions , 2010, European Journal of Human Genetics.

[14]  K. Van Steen,et al.  Molecular Reclassification of Crohn's Disease by Cluster Analysis of Genetic Variants , 2010, PloS one.

[15]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[16]  Larry A. Lambe,et al.  Decision trees for binary classification variables grow equally with the Gini impurity measure and Pearson's chi-square test , 2007, Int. J. Bus. Intell. Data Min..

[17]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[18]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[19]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[20]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[21]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[22]  C. Strobl,et al.  Analysis of the individual and aggregate genetic contributions of previously identified serine peptidase inhibitor Kazal type 5 (SPINK5), kallikrein-related peptidase 7 (KLK7), and filaggrin (FLG) polymorphisms to eczema risk. , 2008, The Journal of allergy and clinical immunology.

[23]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[26]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.