TRM: A Powerful Two‐Stage Machine Learning Approach for Identifying SNP‐SNP Interactions

Studies have shown that interactions of single nucleotide polymorphisms (SNPs) may play an important role in understanding the causes of complex disease. We have proposed an integrated machine learning method that combines two machine‐learning methods—Random Forests (RF) and Multivariate Adaptive Regression Splines (MARS)—to identify a subset of important SNPs and detect interaction patterns more effectively and efficiently. In this two‐stage RF‐MARS (TRM) approach, RF is first applied to detect a predictive subset of SNPs, and then MARS is used to identify the interaction patterns. We evaluated the TRM performances in four models. RF variable selection was based on out‐of‐bag classification error rate (OOB) and variable important spectrum (IS). Our results support that RFOOB had better performance than MARS and RFIS in detecting important variables. This study demonstrates that TRMOOB, which is RFOOB plus MARS, has combined the strengths of RF and MARS in identifying SNP‐SNP interactions in a scenario of 100 candidate SNPs. TRMOOB had greater true positive rate and lower false positive rate compared with MARS, particularly for searching interactions with a strong association with the outcome. Therefore, the use of TRMOOB is favored for exploring SNP‐SNP interactions in a large‐scale genetic variation study.

[1]  Sunshin Kim,et al.  Epistasis between CYP19A1 and ESR1 polymorphisms is associated with premature ovarian failure. , 2011, Fertility and sterility.

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Stephen J Sawcer,et al.  Variation within DNA repair pathway genes and risk of multiple sclerosis. , 2010, American journal of epidemiology.

[4]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[5]  Pierre Geurts,et al.  A screening methodology based on Random Forests to improve the detection of gene–gene interactions , 2010, European Journal of Human Genetics.

[6]  G. Risbridger,et al.  Aromatase and regulating the estrogen:androgen ratio in the prostate gland , 2010, The Journal of Steroid Biochemistry and Molecular Biology.

[7]  Qiang Yang,et al.  BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies , 2010, American journal of human genetics.

[8]  Hui-Yi Lin,et al.  Cytokine genetic polymorphisms and prostate cancer aggressiveness. , 2009, Carcinogenesis.

[9]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[10]  I. Thompson,et al.  Single and Multigenic Analysis of the Association between Variants in 12 Steroid Hormone Metabolism Genes and Risk of Prostate Cancer , 2009, Cancer Epidemiology Biomarkers & Prevention.

[11]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[12]  G. Cancel-Tassin,et al.  Association between estrogen and androgen receptor genes and prostate cancer risk. , 2009, European journal of endocrinology.

[13]  Michael Krawczak,et al.  A comprehensive evaluation of SNP genotype imputation , 2009, Human Genetics.

[14]  A. Foulkes,et al.  Application of two machine learning algorithms to genetic association studies in the presence of covariates , 2008, BMC Genetics.

[15]  Leann Myers,et al.  Comparison of multivariate adaptive regression splines and logistic regression in detecting SNP-SNP interactions and their application in prostate cancer , 2008, Journal of Human Genetics.

[16]  Hui-Yi Lin,et al.  Interactions of cytokine gene polymorphisms in prostate cancer risk. , 2007, Carcinogenesis.

[17]  Qiong Yang,et al.  Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and Bayesian networks , 2007, BMC proceedings.

[18]  Jing Ma,et al.  Prostate Cancer Risk and ESR1 TA, ESR2 CA Repeat Polymorphisms , 2007, Cancer Epidemiology Biomarkers & Prevention.

[19]  G. Carruba Estrogen and prostate cancer: An eclipsed truth in an androgen‐dominated scenario , 2007, Journal of cellular biochemistry.

[20]  Ramón Díaz-Uriarte,et al.  GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest , 2007, BMC Bioinformatics.

[21]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[22]  SNPs associated with prostate cancer risk and prognosis. , 2007, Frontiers in bioscience : a journal and virtual library.

[23]  P. Fearnhead,et al.  Genome-wide association study of prostate cancer identifies a second risk locus at 8q24 , 2007, Nature Genetics.

[24]  Thomas Lumley,et al.  Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. , 2006, American journal of epidemiology.

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[26]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[27]  Martin Zwick,et al.  Statistical Applications in Genetics and Molecular Biology Reconstructability Analysis as a Tool for Identifying Gene-Gene Interactions in Studies of Human Diseases , 2011 .

[28]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[29]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[30]  N. Cook,et al.  Tree and spline based association analysis of gene–gene interaction models for ischemic stroke , 2004, Statistics in medicine.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Jason H. Moore,et al.  The Ubiquitous Nature of Epistasis in Determining Susceptibility to Common Human Diseases , 2003, Human Heredity.

[33]  R. Dahiya,et al.  Polymorphisms of estrogen receptor alpha in prostate cancer , 2003, Molecular carcinogenesis.

[34]  J. H. Moore,et al.  Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. , 2001, American journal of human genetics.

[35]  C. Sing,et al.  A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. , 2001, Genome research.

[36]  Charles B. Roosen,et al.  An introduction to multivariate adaptive regression splines , 1995, Statistical methods in medical research.

[37]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[38]  Lyle H. Ungar,et al.  A comparison of two nonparametric estimation schemes: MARS and neural networks , 1993 .

[39]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .