Pathway analysis using random forests with bivariate node-split for survival outcomes

MOTIVATION There is great interest in pathway-based methods for genomics data analysis in the research community. Although machine learning methods, such as random forests, have been developed to correlate survival outcomes with a set of genes, no study has assessed the abilities of these methods in incorporating pathway information for analyzing microarray data. In general, genes that are identified without incorporating biological knowledge are more difficult to interpret. Correlating pathway-based gene expression with survival outcomes may lead to biologically more meaningful prognosis biomarkers. Thus, a comprehensive study on how these methods perform in a pathway-based setting is warranted. RESULTS In this article, we describe a pathway-based method using random forests to correlate gene expression data with survival outcomes and introduce a novel bivariate node-splitting random survival forests. The proposed method allows researchers to identify important pathways for predicting patient prognosis and time to disease progression, and discover important genes within those pathways. We compared different implementations of random forests with different split criteria and found that bivariate node-splitting random survival forests with log-rank test is among the best. We also performed simulation studies that showed random forests outperforms several other machine learning algorithms and has comparable results with a newly developed component-wise Cox boosting model. Thus, pathway-based survival analysis using machine learning tools represents a promising approach in dissecting pathways and for generating new biological hypothesis from microarray studies. AVAILABILITY R package Pwayrfsurvival is available from URL: http://www.duke.edu/~hp44/pwayrfsurvival.htm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  L. Altucci,et al.  RAR and RXR modulation in cancer and metabolic disease , 2007, Nature Reviews Drug Discovery.

[2]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[3]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[4]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[5]  Lawrence C. Brody,et al.  BRCA1 regulates the G2/M checkpoint by activating Chk1 kinase upon DNA damage , 2002, Nature Genetics.

[6]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[7]  P. Hall,et al.  An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Aleksandra M. Michalowska,et al.  Transforming Growth Factor-β Can Suppress Tumorigenesis through Effects on the Putative Cancer Stem or Early Progenitor Cell and Committed Progeny in a Breast Cancer Xenograft Model , 2007 .

[9]  R. Hesketh,et al.  Inhibiting mutations in the transforming growth factor beta type 2 receptor in recurrent human breast cancer. , 2001, Cancer research.

[10]  S. Ménard,et al.  Expression of protein tyrosine phosphatase alpha (RPTPα) in human breast cancer correlates with low tumor grade, and inhibits tumor cell growth in vitro and in vivo , 2000, Oncogene.

[11]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[12]  Allan Balmain,et al.  TGF-β signaling in tumor suppression and cancer progression , 2001, Nature Genetics.

[13]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[14]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.

[15]  Anne-Laure Boulesteix,et al.  Survival prediction using gene expression data: A review and comparison , 2009, Comput. Stat. Data Anal..

[16]  Y Pawitan,et al.  Gene expression profiling for prognosis using Cox regression , 2004, Statistics in medicine.

[17]  Frederic D Sigoillot,et al.  Breakdown of the regulatory control of pyrimidine biosynthesis in human breast cancer cells , 2004, International journal of cancer.

[18]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[19]  R. Millikan,et al.  Radiation clastogenesis and cell cycle checkpoint function as functional markers of breast cancer risk. , 2006, Carcinogenesis.

[20]  I. Ellis,et al.  Do poor-prognosis breast tumours express membrane cofactor proteins (CD46)? , 2005, Cancer Immunology, Immunotherapy.

[21]  N. Hynes,et al.  BAD: a good therapeutic target? , 2002, Breast Cancer Research.

[22]  M. Mathews,et al.  Human breast cancer cells contain elevated levels and activity of the protein kinase, PKR , 2000, Oncogene.

[23]  J. Massagué,et al.  G1 cell-cycle control and cancer , 2004, Nature.

[24]  D. Weaver,et al.  Higher Stromal Expression of Transforming Growth Factor-beta Type II Receptors is Associated with Poorer Prognosis Breast Tumors , 2003, Breast Cancer Research and Treatment.

[25]  Richard Dybowski,et al.  Clinical applications of artificial neural networks: Theory , 2001 .

[26]  M. Guzmán,et al.  Delta9-tetrahydrocannabinol inhibits cell cycle progression in human breast cancer cells through Cdc2 regulation. , 2006, Cancer research.

[27]  J. Nussbaum,et al.  Transcriptional upregulation of interferon-induced protein kinase, PKR, in breast cancer. , 2003, Cancer letters.

[28]  Jürgen Wolf,et al.  CASPAR: a hierarchical Bayesian approach to predict survival times in cancer from gene expression data , 2006, Bioinform..

[29]  Hongyu Zhao,et al.  Building pathway clusters from Random Forests classification using class votes , 2008, BMC Bioinformatics.

[30]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[31]  L. Donehower,et al.  Inactivation of the Wip1 phosphatase inhibits mammary tumorigenesis through p38 MAPK–mediated activation of the p16Ink4a-p19Arf pathway , 2004, Nature Genetics.

[32]  Masafumi Nakamura,et al.  The Hedgehog pathway is a possible therapeutic target for patients with estrogen receptor-negative breast cancer. , 2009, Anticancer research.

[33]  F. Stivala,et al.  Genotoxic stress leads to centrosome amplification in breast cancer cell lines that have an inactive G1/S cell cycle checkpoint , 2004, Oncogene.

[34]  G. Page,et al.  Hedgehog signaling and response to cyclopamine differs in epithelial and stromal cells in benign breast and breast cancer , 2006, Cancer biology & therapy.

[35]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[36]  Urs Eppenberger,et al.  Low E2F1 transcript levels are a strong determinant of favorable breast cancer outcome , 2007, Breast Cancer Research.

[37]  K. O’Neill,et al.  Can thymidine kinase levels in breast tumors predict disease recurrence? , 1992, Journal of the National Cancer Institute.

[38]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..

[39]  Jiri Bartek,et al.  Cell-cycle checkpoints and cancer , 2004, Nature.

[40]  Brian D. Ripley,et al.  Clinical applications of artificial neural networks: Neural networks as statistical methods in survival analysis , 2001 .

[41]  Blaise Hanczar,et al.  Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings , 2007, EURASIP J. Bioinform. Syst. Biol..

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Gun-Hee Kim,et al.  Apigenin causes G(2)/M arrest associated with the modulation of p21(Cip1) and Cdc2 and activates p53-dependent apoptosis pathway in human breast cancer SK-BR-3 cells. , 2009, The Journal of nutritional biochemistry.

[44]  Anatoly L. Mayburd,et al.  Successful anti-cancer drug targets able to pass FDA review demonstrate the identifiable signature distinct from the signatures of random genes and initially proposed targets , 2008, Bioinform..

[45]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[46]  E. Appella,et al.  The role of the MKK6/p38 MAPK pathway in Wip1-dependent regulation of ErbB2-driven mammary gland tumorigenesis , 2007, Oncogene.

[47]  Joseph D. Szustakowski,et al.  Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation , 2007, Bioinform..

[48]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[49]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[50]  Ubiquitinated or sumoylated retinoic acid receptor alpha determines its characteristic and interacting model with retinoid X receptor alpha in gastric and breast cancer cells. , 2004, Journal of molecular endocrinology.

[51]  Ian O Ellis,et al.  Loss of CD55 Is Associated with Aggressive Breast Tumors , 2004, Clinical Cancer Research.

[52]  R Beuscart,et al.  Prognostic significance of insulin-like growth factor 1 receptors in human breast cancer. , 1990, Cancer research.

[53]  P. Bühlmann Boosting for high-dimensional linear models , 2006 .

[54]  T. Kinsella,et al.  BRCA1 activates a G2-M cell cycle checkpoint following 6-thioguanine-induced DNA mismatch damage. , 2007, Cancer research.

[55]  A. Jakubowska,et al.  CDKN2A-positive breast cancers in young women from Poland , 2007, Breast Cancer Research and Treatment.

[56]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[57]  Helmut Strasser,et al.  On the Asymptotic Theory of Permutation Statistics , 1999 .

[58]  Ludger Evers,et al.  Sparse kernel methods for high-dimensional survival data , 2008, Bioinform..

[59]  Daohai Zhang,et al.  Proteomic Study Reveals That Proteins Involved in Metabolic and Detoxification Pathways Are Highly Expressed in HER-2/neu-positive Breast Cancer* , 2005, Molecular & Cellular Proteomics.

[60]  D. Shalloway,et al.  Apoptosis of estrogen‐receptor negative breast cancer and colon cancer cell lines by PTPα and src RNAi , 2008, International journal of cancer.

[61]  J. Bartlett,et al.  Bad expression predicts outcome in patients treated with tamoxifen , 2007, Breast Cancer Research and Treatment.

[62]  M. Reiss,et al.  Transforming growth factor beta type I receptor kinase mutant associated with metastatic breast cancer. , 1998, Cancer research.

[63]  C. Orlandini,et al.  Cyclin A and E2F1 overexpression correlate with reduced disease-free survival in node-negative breast cancer patients. , 2006, Anticancer research.

[64]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[65]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[66]  Torsten Hothorn,et al.  On the Exact Distribution of Maximally Selected Rank Statistics , 2002, Comput. Stat. Data Anal..

[67]  Heung-Chin Cheng,et al.  Activation of Src in human breast tumor cell lines: elevated levels of phosphotyrosine phosphatase activity that preferentially recognizes the Src carboxy terminal negative regulatory tyrosine 530 , 1999, Oncogene.

[68]  Lionel Tarassenko,et al.  Non‐linear survival analysis using neural networks , 2004, Statistics in medicine.

[69]  Xihong Lin,et al.  Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection , 2009, Bioinform..

[70]  P. V. van Diest,et al.  Expression of growth factors, growth‐inhibiting factors, and their receptors in invasive breast cancer. II: Correlations with proliferation and angiogenesis , 1998, The Journal of pathology.

[71]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.