Tree-Weighting for Multi-Study Ensemble Learners

Multi-study learning uses multiple training studies, separately trains classifiers on individual studies, and then forms ensembles with weights rewarding members with better cross-study prediction ability. This article considers novel weighting approaches for constructing tree-based ensemble learners in this setting. Using Random Forests as a single-study learner, we perform a comparison of either weighting each forest to form the ensemble, or extracting the individual trees trained by each Random Forest and weighting them directly. We consider weighting approaches that reward cross-study replicability within the training set. We find that incorporating multiple layers of ensembling in the training process increases the robustness of the resulting predictor. Furthermore, we explore the mechanisms by which the ensembling weights correspond to the internal structure of trees to shed light on the important features in determining the relationship between the Random Forests algorithm and the true outcome model. Finally, we apply our approach to genomic datasets and show that our method improves upon the basic multi-study learning paradigm.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  C. Planey,et al.  CoINcIDE: A framework for discovery of patient subtypes across multiple datasets , 2016, Genome Medicine.

[3]  Giovanni Parmigiani,et al.  The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. , 2018, Biostatistics.

[4]  Benjamin Haibe-Kains,et al.  curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome , 2013, Database J. Biol. Databases Curation.

[5]  G. Tang,et al.  Indian Hedgehog: A Mechanotransduction Mediator in Condylar Cartilage , 2004, Journal of dental research.

[6]  Benjamin Haibe-Kains,et al.  MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature , 2016, Scientific Reports.

[7]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[8]  Hongshik Ahn,et al.  A weight-adjusted voting algorithm for ensembles of classifiers , 2011 .

[9]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[10]  Nicolás García-Pedrajas,et al.  Random feature weights for decision tree ensemble construction , 2012, Inf. Fusion.

[11]  Robert R. Freimuth,et al.  A weighted random forests approach to improve predictive performance , 2013, Stat. Anal. Data Min..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Prasad Patil,et al.  Training replicable predictors in multiple studies , 2018, Proceedings of the National Academy of Sciences.

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Ranadip Pal,et al.  Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction , 2015, Cancer informatics.

[16]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[17]  Giovanni Parmigiani,et al.  The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models , 2018, bioRxiv.

[18]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.