An approach using random forest methodology for disease risk prediction using imbalanced case–control data in GWAS

Abstract As single nucleotide polymorphisms (SNPs) are known to be associated with the disease, prediction of disease risk of an individual based on SNP genotyping data using start-of-art prediction techniques is an important problem in the area of genome wide association studies (GWAS). In the present investigation, an approach based on random forest (RF) methodology has been proposed for the prediction of disease risk from imbalanced case-control data. The proposed approach was compared with the existing methods meant for imbalanced data, namely, balanced random forest (BRF) and weighted random forest (WRF) based on several performance metrics. The proposed approach was illustrated using a case–control data set of Ulcerative colitis and was found to perform better in terms of prediction accuracy over the existing methods.

[1]  Taghi M. Khoshgoftaar,et al.  The Detection of Fault-Prone Programs , 1992, IEEE Trans. Software Eng..

[2]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[3]  Abdollah Dehzangi,et al.  Using Random Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, J. Inf. Sci. Eng..

[4]  Heping Zhang,et al.  Detecting significant single-nucleotide polymorphisms in a rheumatoid arthritis study using random forests , 2009 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[7]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[8]  Houari A. Sahraoui,et al.  Combining and adapting software quality predictive models by genetic algorithms , 2002, Proceedings 17th IEEE International Conference on Automated Software Engineering,.

[9]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[10]  Yunming Ye,et al.  Stratified Random Forest for Genome-wide Association Study , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[11]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[12]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[13]  Taghi M. Khoshgoftaar,et al.  A neural network approach for early detection of program modules having high risk in the maintenance phase , 1995, J. Syst. Softw..

[14]  Jonathan D. Hirst,et al.  Prediction of glycosylation sites using random forests , 2008, BMC Bioinformatics.

[15]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[16]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.

[19]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.