Improved high-dimensional prediction with Random Forests by the use of co-data

BackgroundPrediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting.ResultsCo-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study.ConclusionThe proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

[1]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[2]  P J F Snijders,et al.  Genome-wide DNA copy number alterations in head and neck squamous cell carcinomas with or without oncogene-expressing human papillomavirus , 2006, Oncogene.

[3]  Wessel N van Wieringen,et al.  Testing the prediction error difference between 2 predictors. , 2009, Biostatistics.

[4]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[5]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[6]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[7]  I. Glad,et al.  Weighted Lasso with Data Integration , 2011, Statistical applications in genetics and molecular biology.

[8]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  Wessel N van Wieringen,et al.  Better prediction by use of co‐data: adaptive group‐regularized ridge regression , 2014, Statistics in medicine.

[11]  Mark I. McCarthy,et al.  The South Asian Genome , 2014, PloS one.

[12]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[13]  Lin Song,et al.  Random generalized linear model: a highly accurate and interpretable ensemble predictor , 2013, BMC Bioinformatics.

[14]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[15]  Sandra Alemany,et al.  An ensemble of ordered logistic regression and random forest for child garment size matching , 2016, Comput. Ind. Eng..

[16]  Udaya B. Kogalur,et al.  Random Survival Forests for R , 2007 .

[17]  Jonathan Pevsner,et al.  Gene expression alterations over large chromosomal regions in cancers include multiple genes unrelated to malignant progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[19]  Irene Epifanio,et al.  Intervention in prediction measure: a new approach to assessing variable importance for random forests , 2017, BMC Bioinformatics.

[20]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[21]  Philippe Broët,et al.  Prediction of clinical outcome in multiple lung cancer cohorts by integrative genomics: implications for chemotherapy selection. , 2009, Cancer research.

[22]  Ed Schuuring,et al.  Validation of a gene expression signature for assessment of lymph node metastasis in oral squamous cell carcinoma. , 2012, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[23]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[24]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[25]  Philip Lijnzaad,et al.  An expression profile for diagnosis of lymph node metastases from primary head and neck squamous cell carcinomas , 2005, Nature Genetics.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Anne-Laure Boulesteix,et al.  AUC-RF: A New Strategy for Genomic Profiling with Random Forest , 2011, Human Heredity.

[28]  Christian P. Robert,et al.  An introduction to the special issue “Joint IMS-ISBA meeting - MCMSki 4” , 2015, Stat. Comput..

[29]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[30]  Wina Verlaat,et al.  Identification and Validation of a 3-Gene Methylation Classifier for HPV-Based Cervical Screening on Self-Samples , 2018, Clinical Cancer Research.

[31]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Patrick Kemmeren,et al.  Multiple robust signatures for detecting lymph node metastasis in head and neck cancer. , 2006, Cancer research.

[33]  Paul H. C. Eilers,et al.  Flexible smoothing with B-splines and penalties , 1996 .

[34]  M. Grce,et al.  Genome-wide DNA methylation assay reveals novel candidate biomarker genes in cervical cancer , 2013, Epigenetics.

[35]  Steven J. M. Jones,et al.  Comprehensive genomic characterization of head and neck squamous cell carcinomas , 2015, Nature.

[36]  Ruud H. Brakenhoff,et al.  Prognostic modeling of oral cancer by gene profiles and clinicopathological co-variables , 2017, Oncotarget.

[37]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[38]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[39]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[40]  Simon N. Wood,et al.  Shape constrained additive models , 2015, Stat. Comput..

[41]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..