Cross-Cluster Weighted Forests

Adapting machine learning algorithms to better handle the presence of natural clustering or batch effects within training datasets is imperative across a wide variety of biological applications. This article considers the effect of ensembling Random Forest learners trained on clusters within a single dataset with heterogeneity in the distribution of the features. We find that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm. We denote our novel approach as the Cross-Cluster Weighted Forest, and examine its robustness to various data-generating scenarios and outcome models. Furthermore, we explore the influence of the data-partitioning and ensemble weighting strategies on conferring the benefits of our method over the existing paradigm. Finally, we apply our approach to cancer molecular profiling and gene expression datasets that are naturally divisible into clusters and illustrate that our approach outperforms classic Random Forest. Code and supplementary material are available at https://github.com/m-ramchandran/cross-

[1]  Nathan Intrator,et al.  Boosted Mixture of Experts: An Ensemble Learning Scheme , 1999, Neural Computation.

[2]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[3]  Manuele Bicego,et al.  K-Random Forests: a K-means style algorithm for Random Forest clustering , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[4]  Zachary A. Pardos,et al.  The Utility of Clustering in Prediction Tasks , 2015, ArXiv.

[5]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[6]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[7]  J. Friedman Stochastic gradient boosting , 2002 .

[8]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[9]  Joydeep Ghosh,et al.  A framework for simultaneous co-clustering and learning from complex data , 2007, KDD '07.

[10]  Maya Ramchandran,et al.  On Ensembling vs Merging: Least Squares and Random Forests under Covariate Shift , 2021, 2106.02589.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  D. Obradovic,et al.  Combining Artificial Neural Nets , 1999, Perspectives in Neural Computing.

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  Benjamin Haibe-Kains,et al.  curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome , 2013, Database J. Biol. Databases Curation.

[15]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[16]  Prasad Patil,et al.  Training replicable predictors in multiple studies , 2018, Proceedings of the National Academy of Sciences.

[17]  Paul D. McNicholas,et al.  Variable Selection for Clustering and Classification , 2013, J. Classif..

[18]  Yvonne Vergouwe,et al.  Prediction models for clustered data: comparison of a random intercept and standard regression model , 2013, BMC Medical Research Methodology.

[19]  Steven J. M. Jones,et al.  Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. , 2015, The New England journal of medicine.

[20]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[21]  G. Verbeke,et al.  A Linear Mixed-Effects Model with Heterogeneity in the Random-Effects Population , 1996 .

[22]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[23]  Limsoon Wong,et al.  Why Batch Effects Matter in Omics Data, and How to Avoid Them. , 2017, Trends in biotechnology.

[24]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[25]  Prasad Patil,et al.  Tree-Weighting for Multi-Study Ensemble Learners , 2019, bioRxiv.

[26]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[27]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[28]  Michael I. Jordan,et al.  Cluster Forests , 2011, Comput. Stat. Data Anal..

[29]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[30]  Tieliu Shi,et al.  A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data , 2010, The Pharmacogenomics Journal.

[31]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.