Pathway analysis using random forests classification and regression

MOTIVATION Although numerous methods have been developed to better capture biological information from microarray data, commonly used single gene-based methods neglect interactions among genes and leave room for other novel approaches. For example, most classification and regression methods for microarray data are based on the whole set of genes and have not made use of pathway information. Pathway-based analysis in microarray studies may lead to more informative and relevant knowledge for biological researchers. RESULTS In this paper, we describe a pathway-based classification and regression method using Random Forests to analyze gene expression data. The proposed methods allow researchers to rank important pathways from externally available databases, discover important genes, find pathway-based outlying cases and make full use of a continuous outcome variable in the regression setting. We also compared Random Forests with other machine learning methods using several datasets and found that Random Forests classification error rates were either the lowest or the second-lowest. By combining pathway information and novel statistical methods, this procedure represents a promising computational strategy in dissecting pathways and can provide biological insight into the study of microarray data. AVAILABILITY Source code written in R is available from http://bioinformatics.med.yale.edu/pathway-analysis/rf.htm.

[1]  Hongyu Zhao,et al.  Acute Drug-Induced Vascular Injury in Beagle Dogs: Pathology and Correlating Genomic Expression , 2006, Toxicologic pathology.

[2]  P. Shanthi,et al.  Therapeutic effect of tamoxifen and energy-modulating vitamins on carbohydrate-metabolizing enzymes in breast cancer , 2005, Cancer Chemotherapy and Pharmacology.

[3]  Alexandra Paillusson,et al.  A GFP-based reporter system to monitor nonsense-mediated mRNA decay , 2005, Nucleic acids research.

[4]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[5]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[6]  Pankaj Agarwal,et al.  Inferring pathways from gene lists using a literature-derived network of biological relationships , 2005, Bioinform..

[7]  D. Rothenbacher,et al.  Differential Expression of Chemokines, Risk of Stable Coronary Heart Disease, and Correlation with Established Cardiovascular Risk Markers , 2006, Arteriosclerosis, thrombosis, and vascular biology.

[8]  Y. Minami,et al.  Hypoxia‐inducible factor‐1α induces cell cycle arrest of endothelial cells , 2002, Genes to cells : devoted to molecular & cellular mechanisms.

[9]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[10]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[12]  H. Willard,et al.  X-inactivation profile reveals extensive variability in X-linked gene expression in females , 2005, Nature.

[13]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[14]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[15]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[17]  M. Schwartz,et al.  12/15-Lipoxygenase Regulates Intercellular Adhesion Molecule-1 Expression and Monocyte Adhesion to Endothelium Through Activation of RhoA and Nuclear Factor-&kgr;B , 2005, Arteriosclerosis, thrombosis, and vascular biology.

[18]  F. Thaiss,et al.  Compartment-specific expression and function of the chemokine IP-10/CXCL10 in a model of renal endothelial microvascular injury. , 2006, Journal of the American Society of Nephrology : JASN.

[19]  Debashis Ghosh,et al.  Identification of GATA3 as a breast cancer prognostic marker by global gene expression meta-analysis. , 2005, Cancer research.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[22]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[23]  C. Evereklioglu,et al.  Adenosine deaminase enzyme activity is increased and negatively correlates with catalase, superoxide dismutase and glutathione peroxidase in patients with Behçet's disease: original contributions/clinical and laboratory investigations. , 2003, Mediators of inflammation.

[24]  C. Disteche,et al.  Escape from X inactivation , 2003, Cytogenetic and Genome Research.

[25]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[26]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[27]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[28]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[29]  Philip M. Long,et al.  Breast cancer classification and prognosis based on gene expression profiles from a population-based study , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[30]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[31]  T. Miyata,et al.  Changes of gene expression by lysophosphatidylcholine in vascular endothelial cells: 12 up-regulated distinct genes including 5 cell growth-related, 3 thrombosis-related, and 4 others. , 1998, Journal of biochemistry.

[32]  N. Hynes,et al.  BAD: a good therapeutic target? , 2002, Breast Cancer Research.

[33]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[34]  Adrian Wiestner,et al.  A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Myles Brown,et al.  Advances in estrogen receptor biology: prospects for improvements in targeted breast cancer therapy , 2003, Breast Cancer Research.

[36]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[37]  E. Appella,et al.  Post-translational modifications and activation of p53 by genotoxic stresses. , 2001, European journal of biochemistry.

[38]  Gang Liu,et al.  Effects of cigarette smoke on the human airway epithelial cell transcriptome. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  J. Bergh,et al.  Identification of molecular apocrine breast tumours by microarray analysis , 2005, Breast Cancer Research.

[40]  M. Orešič,et al.  Pathways to the analysis of microarray data. , 2005, Trends in biotechnology.

[41]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[42]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[43]  S. Qin,et al.  Casein Kinase 1α Interacts with Retinoid X Receptor and Interferes with Agonist-induced Apoptosis* , 2004, Journal of Biological Chemistry.

[44]  J. Warren,et al.  Nitric Oxide Modulates MCP-1 Expression in Endothelial Cells: Implications for the Pathogenesis of Pulmonary Granulomatous Vasculitis , 2003, Inflammation.

[45]  I. Charo,et al.  Chemokines in the pathogenesis of vascular disease. , 2004, Circulation research.

[46]  M. Borlu,et al.  Adenosine deaminase enzyme levels, their relation with disease activity, and the effect of colchicine on adenosine deaminase levels in patients with Behçet’s disease , 2005, Rheumatology International.