Variable selection from random forests: application to gene expression data

AbstractRandom forest is a classification algorithm well suited for microarray data: it shows excellent performanceeven when most predictive variables are noise, can be used when the number of variables is much largerthan the number of observations, and returns measures of variable importance. Thus, it is important tounderstand the performance of random forest with microarray data and its use for gene selection.We first show the effects of changes in parameters of random forest on the prediction error. Then wepresent an approach for gene selection that uses measures of variable importance and error rate, and istargeted towards the selection of small sets of genes. Using simulated and real microarray data, we showthat the gene selection procedure yields small sets of genes while preserving predictive accuracy.We first show the effects of changes in parameters of random forest on the prediction error rate withmicroarray data. Then we present two approaches for gene selection with random forest: 1) comparingvariable importance plots of variable importance from original and permuted data sets; 2) using backwardsvariable elimination. Using simulated and real microarray data, we show: 1) variable importance plots canbe used to recover the full set of genes related to the outcome of interest, without being adversely affected bycollinearities; 2) backwards variable elimination yields small sets of genes while preserving predictive accuracy(compared to several state-of-the art algorithms). Thus, both methods are useful for gene selection.All code is available as an Rpackage, varSelRF,from CRANhttp://cran.r-project.org/src/contrib/PACKAGES.htmlor from the supplementary material page.Supplementary information: http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  J. Faraway On the Cost of Data Analysis , 1992 .

[3]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[4]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[5]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[6]  Ting Wang,et al.  Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules , 2004, Multiple Classifier Systems.

[7]  Josée Dupuis,et al.  Mapping complex traits using Random Forests , 2003, BMC Genetics.

[8]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[9]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[10]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[11]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[12]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[15]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[16]  Ana Osorio,et al.  A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[17]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[18]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[19]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[20]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[21]  G. Izmirlian,et al.  Application of the Random Forest Classification Algorithm to a SELDI‐TOF Proteomics Study in the Setting of a Cancer Prevention Trial , 2004, Annals of the New York Academy of Sciences.

[22]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[23]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[24]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[25]  Ramón Díaz-Uriarte,et al.  Supervised Methods with Genomic Data: a Review and Cautionary View , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[28]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[29]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[30]  R. Adams Proceedings , 1947 .

[31]  Kjell Johnson,et al.  Evaluating Methods for Classifying Expression Data , 2004, Journal of biopharmaceutical statistics.

[32]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[33]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[34]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[36]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[37]  Stanley N Cohen,et al.  Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[39]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[40]  Joaquín Dopazo,et al.  New Challenges in Gene Expression Data Analysis and the Extended GEPAS , 2004, Spanish Bioinformatics Conference.

[41]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[42]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[43]  Philip Lijnzaad,et al.  An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas , 2005, Nature Genetics.

[44]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[45]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[46]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[48]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[49]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[50]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.