SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breiman's random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

[1]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[2]  Adele Cutler,et al.  An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings , 2010, BMC Genetics.

[3]  Xiang Chen,et al.  Maximal conditional chi-square importance in random forests , 2010, Bioinform..

[4]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[5]  Jeffrey S. Simonoff,et al.  An Investigation of Missing Data Methods for Classification Trees , 2006, J. Mach. Learn. Res..

[6]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[7]  Bhekisipho Twala,et al.  AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES , 2009, Appl. Artif. Intell..

[8]  D. Stephan,et al.  Genetic control of human brain transcript expression in Alzheimer disease. , 2009, American journal of human genetics.

[9]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[10]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[11]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[12]  Anne-Laure Boulesteix,et al.  Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value , 2008, Bioinform..

[13]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[14]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[15]  Rich Caruana,et al.  An empirical evaluation of supervised learning in high dimensions , 2008, ICML '08.

[16]  David J. Hand,et al.  Good methods for coping with missing data in decision trees , 2008, Pattern Recognit. Lett..

[17]  I. König,et al.  Picking single-nucleotide polymorphisms in forests , 2007, BMC proceedings.

[18]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[19]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[20]  T. Hudson,et al.  A genome-wide association study identifies novel risk loci for type 2 diabetes , 2007, Nature.

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  Lawrence O. Hall,et al.  A Comparison of Decision Tree Ensemble Creation Techniques , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[24]  Sonja W. Scholz,et al.  Genome-wide genotyping in Parkinson's disease and neurologically normal controls: first stage analysis and public release of data , 2006, The Lancet Neurology.

[25]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[26]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[27]  Sio Iong Ao,et al.  CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs , 2005, Bioinform..

[28]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[29]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Zakkula Govindarajulu Elements of sampling theory and methods , 1999 .