Power of Data Mining Methods to Detect Genetic Associations and Interactions

Background: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). Methods: We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. Results: The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. Conclusions: Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.

[1]  Li Wang,et al.  Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects , 2009, BMC proceedings.

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[4]  N. Rothman,et al.  Cytokine polymorphisms in Th1/Th2 pathway genes, body mass index, and risk of non-Hodgkin lymphoma. , 2011, Blood.

[5]  Thierry Moreau,et al.  A simple procedure for estimating the false discovery rate , 2005, Bioinform..

[6]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[7]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[8]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[9]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[10]  Peter Boyle,et al.  Cytokine polymorphisms in the Th1/Th2 pathway and susceptibility to non-Hodgkin lymphoma. , 2006, Blood.

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  J. Cerhan,et al.  Risk of non-Hodgkin's lymphoma and family history of lymphatic, hematologic, and other cancers. , 2004, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Ricardo Cao,et al.  Evaluating the Ability of Tree‐Based Methods and Logistic Regression for the Detection of SNP‐SNP Interaction , 2009, Annals of human genetics.

[15]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[16]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[17]  Holger Schwender,et al.  Identification of SNP interactions using logic regression. , 2008, Biostatistics.

[18]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[19]  Jason H. Moore,et al.  Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions , 2003, Bioinform..

[20]  Xiang Chen,et al.  Maximal conditional chi-square importance in random forests , 2010, Bioinform..

[21]  N. Rothman,et al.  Polymorphisms in immune function genes and risk of non-Hodgkin lymphoma: findings from the New South Wales non-Hodgkin Lymphoma Study. , 2007, Carcinogenesis.

[22]  Yan V. Sun,et al.  Machine learning in genome‐wide association studies , 2009, Genetic epidemiology.

[23]  Nilanjan Chatterjee,et al.  Common genetic variants in proinflammatory and other immunoregulatory genes and risk for non-Hodgkin lymphoma. , 2006, Cancer research.

[24]  P. Brennan,et al.  Association of JAK‐STAT pathway related genes with lymphoma risk: results of a European case–control study (EpiLymph) , 2011, British journal of haematology.

[25]  Theodore R Holford,et al.  Cytokine polymorphisms in the Th1/Th2 pathway and susceptibility to non-Hodgkin lymphoma. , 2006, Blood.

[26]  Bethany J. Wolf,et al.  Logic Forest: an ensemble classifier for discovering logical combinations of binary markers , 2010, Bioinform..

[27]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[28]  Low-Tone Ho,et al.  Tree-structured supervised learning and the genetics of hypertension. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  Xiang Chen,et al.  Willows: a memory efficient tree and forest construction package , 2009, BMC Bioinformatics.