Convergent Random Forest predictor: methodology for predicting drug response from genome-scale data applied to anti-TNF response.

Biomarker development for prediction of patient response to therapy is one of the goals of molecular profiling of human tissues. Due to the large number of transcripts, relatively limited number of samples, and high variability of data, identification of predictive biomarkers is a challenge for data analysis. Furthermore, many genes may be responsible for drug response differences, but often only a few are sufficient for accurate prediction. Here we present an analysis approach, the Convergent Random Forest (CRF) method, for the identification of highly predictive biomarkers. The aim is to select from genome-wide expression data a small number of non-redundant biomarkers that could be developed into a simple and robust diagnostic tool. Our method combines the Random Forest classifier and gene expression clustering to rank and select a small number of predictive genes. We evaluated the CRF approach by analyzing four different data sets. The first set contains transcript profiles of whole blood from rheumatoid arthritis patients, collected before anti-TNF treatment, and their subsequent response to the therapy. In this set, CRF identified 8 transcripts predicting response to therapy with 89% accuracy. We also applied the CRF to the analysis of three previously published expression data sets. For all sets, we have compared the CRF and recursive support vector machines (RSVM) approaches to feature selection and classification. In all cases the CRF selects much smaller number of features, five to eight genes, while achieving similar or better performance on both training and independent testing sets of data. For both methods performance estimates using cross-validation is similar to performance on independent samples. The method has been implemented in R and is available from the authors upon request: Jadwiga.Bienkowska@biogenidec.com.

[1]  P. V. van Riel,et al.  The Disease Activity Score and the EULAR response criteria. , 2005, Clinical and experimental rheumatology.

[2]  R. E. Edens,et al.  Histone deacetylase inhibitors induce antigen specific anergy in lymphocytes: a comparative study. , 2006, International immunopharmacology.

[3]  S. Szabo,et al.  Histone deacetylase activities are required for innate immune cell control of Th1 but not Th2 effector cell function. , 2007, Blood.

[4]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  S. Horvath,et al.  Global histone modification patterns predict risk of prostate cancer recurrence , 2005, Nature.

[6]  F. Revert,et al.  Increased Goodpasture antigen-binding protein expression induces type IV collagen disorganization and deposit of immunoglobulin A in glomerular basement membrane. , 2007, The American journal of pathology.

[7]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[8]  J. Carulli,et al.  Experimental comparison and cross-validation of Affymetrix HT plate and cartridge array gene expression platforms. , 2008, Genomics.

[9]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[10]  Steve Horvath,et al.  Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma , 2005, Modern Pathology.

[11]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[12]  Lawrence Carin,et al.  Joint Classifier and Feature Optimization for Comprehensive Cancer Diagnosis Using Gene Expression Data , 2004, J. Comput. Biol..

[13]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[14]  John Draper,et al.  Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals , 2006, Proceedings of the National Academy of Sciences.

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[19]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[20]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[21]  S. Horvath,et al.  Insulin growth factor-binding protein 2 is a candidate biomarker for PTEN status and PI3K/Akt pathway activation in glioblastoma and prostate cancer , 2007, Proceedings of the National Academy of Sciences.

[22]  P. van Riel,et al.  Validation of rheumatoid arthritis improvement criteria that include simplified joint counts. , 1998, Arthritis and rheumatism.

[23]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[24]  Emmanuel Chamorey,et al.  HIF‐1α and CA IX staining in invasive breast carcinomas: Prognosis and treatment outcome , 2007 .

[25]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[26]  F. Revert,et al.  Characterization of a Novel Type of Serine/Threonine Kinase That Specifically Phosphorylates the Human Goodpasture Antigen* , 1999, The Journal of Biological Chemistry.

[27]  P. Geborek,et al.  Treatment response to a second or third TNF-inhibitor in RA: results from the South Swedish Arthritis Treatment Group Register. , 2007, Rheumatology.

[28]  Catalin C. Barbacioru,et al.  Evaluation of DNA microarray results with quantitative gene expression platforms , 2006, Nature Biotechnology.

[29]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[30]  Tao Jiang,et al.  OligoSpawn: a software tool for the design of overgo probes from large unigene datasets , 2006, BMC Bioinformatics.

[31]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[32]  A. Kretsovali,et al.  Coordinated changes of histone modifications and HDAC mobilization regulate the induction of MHC class II genes by Trichostatin A , 2006, Nucleic acids research.

[33]  Roy A Jensen,et al.  A human breast cell model of preinvasive to invasive transition. , 2008, Cancer research.

[34]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[35]  Mads Thomassen,et al.  Comparison of Gene Sets for Expression Profiling: Prediction of Metastasis from Low-Malignant Breast Cancer , 2007, Clinical Cancer Research.

[36]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.