Block Forests: random forests for blocks of clinical and omics covariate data

BackgroundIn the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available.ResultsWe identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application.ConclusionsThe new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.

[1]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[2]  A. Boulesteix,et al.  A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies , 2015 .

[3]  Denis Larocque,et al.  A review of survival trees , 2011 .

[4]  Roman Hornung,et al.  Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data , 2018, BMC Bioinformatics.

[5]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[6]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[7]  Yu Jiang,et al.  A Selective Review of Multi-Level Omics Data Integration Using Variable Selection , 2019, High-throughput.

[8]  Maher Rizkalla,et al.  SALMON: Survival Analysis Learning With Multi-Omics Neural Networks on Breast Cancer , 2019, Front. Genet..

[9]  Benjamin J. Raphael,et al.  Mutational landscape and significance across 12 major cancer types , 2013, Nature.

[10]  Claudio Angione,et al.  The poly-omics of ageing through individual-based metabolic modelling , 2018, BMC Bioinformatics.

[11]  Bernd Bischl,et al.  Tunability: Importance of Hyperparameters of Machine Learning Algorithms , 2018, J. Mach. Learn. Res..

[12]  Edgar Wingender,et al.  Connecting high-dimensional mRNA and miRNA expression data for binary medical classification problems , 2013, Comput. Methods Programs Biomed..

[13]  Benjamin Hofner,et al.  Model-based boosting in R: a hands-on tutorial using the R package mboost , 2012, Computational Statistics.

[14]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[15]  Gerhard Schulze,et al.  Clinical Outcome Prediction Based on Multi-Omics Data , 2017 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Anne-Laure Boulesteix,et al.  Added predictive value of high-throughput molecular data to clinical data and its validation , 2011, Briefings Bioinform..

[18]  May D. Wang,et al.  Comparison of RNA-seq and microarray-based models for clinical endpoint prediction , 2015, Genome Biology.

[19]  Yogasudha Veturi,et al.  Increased Proportion of Variance Explained and Prediction Accuracy of Survival of Breast Cancer Patients with Use of Whole-Genome Multiomic Profiles , 2016, Genetics.

[20]  Roman Hornung,et al.  On Fishing for Significance and Statistician’s Degree of Freedom in the Era of Big Molecular Data , 2017 .

[21]  Xiaoyu Jiang,et al.  IPF-LASSO: Integrative L 1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data , 2017, Comput. Math. Methods Medicine.

[22]  Alexander Hapfelmeier,et al.  Nonparametric Subgroup Identification by PRIM and CART: A Simulation and Application Study , 2017, Comput. Math. Methods Medicine.

[23]  Rory Wilson,et al.  A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization , 2015, BMC Medical Research Methodology.

[24]  Jing Qiu,et al.  Sliced inverse regression for integrative multi-omics data analysis , 2019, Statistical applications in genetics and molecular biology.

[25]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[26]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[27]  Colin Campbell,et al.  A pathway-based data integration framework for prediction of disease progression , 2013, Bioinform..

[28]  Lodewyk F. A. Wessels,et al.  TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types , 2016, Bioinform..

[29]  Mohammad Reza Baneshi,et al.  Application of Random Forest Survival Models to Increase Generalizability of Decision Trees: A Case Study in Acute Myocardial Infarction , 2015, Comput. Math. Methods Medicine.

[30]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[31]  Yan Lin,et al.  Bias correction for selecting the minimal-error classifier from many machine learning models , 2014, Bioinform..

[32]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[33]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[34]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[35]  M. Akritas,et al.  NonpModelCheck: An R Package for Nonparametric Lack-of-Fit Testing and Variable Selection , 2017 .

[36]  Chris Sander,et al.  Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles , 2011, PloS one.

[37]  Anne-Laure Boulesteix,et al.  Investigating the prediction ability of survival models based on both clinical and omics data: two case studies , 2014, Statistics in medicine.

[38]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[39]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[40]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .