Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

BackgroundSingle-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.ResultsThis approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders.ConclusionThe presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.

[1]  Yan V. Sun,et al.  Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests , 2007, BMC proceedings.

[2]  P. Donnelly,et al.  Genome-wide strategies for detecting multiple loci that influence complex diseases , 2005, Nature Genetics.

[3]  Ricardo Cao,et al.  Evaluating the Ability of Tree‐Based Methods and Logistic Regression for the Detection of SNP‐SNP Interaction , 2009, Annals of human genetics.

[4]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[5]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[6]  Alberto Piazza,et al.  Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants , 2009, Nature Genetics.

[7]  J. Ott,et al.  Selecting SNPs in two‐stage analysis of disease association data: a model‐free approach , 2000, Annals of human genetics.

[8]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[9]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[10]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[13]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[14]  Nguyen Thanh Tung,et al.  Extensions to Quantile Regression Forests for Very High-Dimensional Data , 2014, PAKDD.

[15]  D. Stephan,et al.  Genetic control of human brain transcript expression in Alzheimer disease. , 2009, American journal of human genetics.

[16]  Sonja W. Scholz,et al.  Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data , 2006, The Lancet Neurology.

[17]  Yunming Ye,et al.  Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces , 2012, Int. J. Data Warehous. Min..

[18]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[21]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[22]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[23]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[24]  Houtao Deng,et al.  Guided Random Forest in the RRF Package , 2013, ArXiv.

[25]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[26]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[27]  M. Ng,et al.  SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests , 2012, IEEE Transactions on NanoBioscience.

[28]  H. Cordell Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. , 2002, Human molecular genetics.

[29]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[30]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[31]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[32]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..