Systems biology Pathway analysis using random forests classification and regression

Motivation: Although numerous methods have been developed to better capture biological information from microarray data, commonly used single gene-based methods neglect interactions among genes and leave room for other novel approaches. For example, most classification and regression methods for microarray data are based on the whole set of genes and have not made use of pathway information. Pathway-based analysis in microarray studies may lead to more informative and relevant knowledge for biological researchers. Results: In this paper, we describe a pathway-based classification and regression method using Random Forests to analyze gene expression data. The proposed methods allow researchers to rank important pathways from externally available databases, discover important genes, find pathway-based outlying cases and make full use of a continuous outcome variable in the regression setting. We also compared Random Forests with other machine learning methods using several datasets and found that Random Forests classification error rates were either the lowest or the second-lowest. By combining pathway information and novel statistical methods, this procedure represents a promising computational strategy in dissecting pathways and can provide biological insight into the study of microarray data. Availability: Source code written in R is available from http:// bioinformatics.med.yale.edu/pathway-analysis/rf.htm Contact: hongyu.zhao@yale.edu Supplementary Information: Supplementary Data are available at http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm

[1]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[2]  Hongyu Zhao,et al.  Acute Drug-Induced Vascular Injury in Beagle Dogs: Pathology and Correlating Genomic Expression , 2006, Toxicologic pathology.

[3]  D. Rothenbacher,et al.  Differential Expression of Chemokines, Risk of Stable Coronary Heart Disease, and Correlation with Established Cardiovascular Risk Markers , 2006, Arteriosclerosis, thrombosis, and vascular biology.

[4]  Debashis Ghosh,et al.  Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis. , 2005, Cancer research.

[5]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[7]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[8]  VN Kristensen Predicting response/resistance to endocrine therapy for breast cancer , 2005, Breast Cancer Research.

[9]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[10]  H. Willard,et al.  X-inactivation profile reveals extensive variability in X-linked gene expression in females , 2005, Nature.

[11]  Pankaj Agarwal,et al.  Inferring pathways from gene lists using a literature-derived network of biological relationships , 2005, Bioinform..

[12]  P. Shanthi,et al.  Therapeutic effect of tamoxifen and energy-modulating vitamins on carbohydrate-metabolizing enzymes in breast cancer , 2005, Cancer Chemotherapy and Pharmacology.

[13]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[14]  I. Charo,et al.  Chemokines in the pathogenesis of vascular disease. , 2004, Circulation research.

[15]  S. Qin,et al.  Casein Kinase 1α Interacts with Retinoid X Receptor and Interferes with Agonist-induced Apoptosis* , 2004, Journal of Biological Chemistry.

[16]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[17]  Myles Brown,et al.  Advances in estrogen receptor biology: prospects for improvements in targeted breast cancer therapy , 2003, Breast Cancer Research.

[18]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[19]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[20]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[21]  C. Disteche,et al.  Escape from X inactivation , 2003, Cytogenetic and Genome Research.

[22]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[23]  M. Hofker Faculty Opinions recommendation of PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. , 2003 .

[24]  J. Warren,et al.  Nitric Oxide Modulates MCP-1 Expression in Endothelial Cells: Implications for the Pathogenesis of Pulmonary Granulomatous Vasculitis , 2003, Inflammation.

[25]  C. Evereklioglu,et al.  Adenosine deaminase enzyme activity is increased and negatively correlates with catalase, superoxide dismutase and glutathione peroxidase in patients with Behçet's disease: original contributions/clinical and laboratory investigations. , 2003, Mediators of inflammation.

[26]  N. Hynes,et al.  BAD: a good therapeutic target? , 2002, Breast Cancer Research.

[27]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[28]  L. Breiman Random Forests , 2001, Machine Learning.

[29]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[30]  T. Miyata,et al.  Changes of gene expression by lysophosphatidylcholine in vascular endothelial cells: 12 up-regulated distinct genes including 5 cell growth-related, 3 thrombosis-related, and 4 others. , 1998, Journal of biochemistry.

[31]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[32]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[33]  F. Thaiss,et al.  Compartment-specific expression and function of the chemokine IP-10/CXCL10 in a model of renal endothelial microvascular injury. , 2006, Journal of the American Society of Nephrology : JASN.

[34]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[35]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[36]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. , 2001, Proceedings of the National Academy of Sciences of the United States of America.