Gene selection and classification of microarray data using random forest

BackgroundSelection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.ResultsWe investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.ConclusionBecause of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  G. Izmirlian,et al.  Application of the Random Forest Classification Algorithm to a SELDI‐TOF Proteomics Study in the Setting of a Cancer Prevention Trial , 2004, Annals of the New York Academy of Sciences.

[3]  Stanley N Cohen,et al.  Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[5]  Cesare Furlanello,et al.  An accelerated procedure for recursive feature ranking on microarray data , 2003, Neural Networks.

[6]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[7]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[9]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[10]  J. M. Deutsch,et al.  Evolutionary algorithms for finding optimal gene sets in microarray prediction , 2003, Bioinform..

[11]  Stefano Toppo,et al.  Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. , 2003, Human molecular genetics.

[12]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[13]  Xin Zhou,et al.  LS Bound based gene selection for DNA microarray data , 2005, Bioinform..

[14]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[15]  Kjell Johnson,et al.  Evaluating Methods for Classifying Expression Data , 2004, Journal of biopharmaceutical statistics.

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Holger Schwender,et al.  A pilot study on the application of statistical classification procedures to molecular epidemiological data. , 2004, Toxicology letters.

[18]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[20]  Theofanis Sapatinas,et al.  Discriminant Analysis and Statistical Pattern Recognition , 2005 .

[21]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[22]  Ting Wang,et al.  Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules , 2004, Multiple Classifier Systems.

[23]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[24]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[25]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[26]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[27]  Roger E Bumgarner,et al.  Correction: Multiclass classification of microarray data with repeated measurements: application to cancer , 2006, Genome Biology.

[28]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[29]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[30]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[31]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[32]  J. Faraway On the Cost of Data Analysis , 1992 .

[33]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[34]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[35]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[36]  J. Stuart Aitken,et al.  Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes , 2005, BMC Bioinformatics.

[37]  Joaquín Dopazo,et al.  Data Analysis and Visualization in Genomics and Proteomics , 2005 .

[38]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[39]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[40]  Josée Dupuis,et al.  Mapping complex traits using Random Forests , 2003, BMC Genetics.

[41]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[42]  G. Getz,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2005, Breast Cancer Research.

[43]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[44]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[45]  Philip Lijnzaad,et al.  An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas , 2005, Nature Genetics.

[46]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[47]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[48]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[49]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[50]  Joaquín Dopazo,et al.  GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data , 2005, Nucleic Acids Res..

[51]  Ana Osorio,et al.  A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. , 2005, Clinical cancer research : an official journal of the American Association for Cancer Research.

[52]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[53]  Adrian E. Raftery,et al.  Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data , 2005, Bioinform..

[54]  Joaquín Dopazo,et al.  Gene expression data preprocessing , 2003, Bioinform..

[55]  Ramón Díaz-Uriarte,et al.  Supervised Methods with Genomic Data: a Review and Cautionary View , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[56]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[57]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[58]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[59]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[60]  E. Wit Design and Analysis of DNA Microarray Investigations , 2004, Human Genomics.

[61]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[62]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[63]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[64]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[65]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[66]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[67]  Peter Bühlmann,et al.  Boosting for Tumor Classification with Gene Expression Data , 2003, Bioinform..

[68]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[70]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .