The Ensemble Bridge Algorithm: A New Modeling Tool for Drug Discovery Problems

Ensemble algorithms have been historically categorized into two separate paradigms, boosting and random forests, which differ significantly in the way each ensemble is constructed. Boosting algorithms represent one extreme, where an iterative greedy optimization strategy, weak learners (e.g., small classification trees), and stage weights are employed to target difficult-to-classify regions in the training space. On the other extreme, random forests rely on randomly selected features and complex learners (learners that exhibit low bias, e.g., large regression trees) to classify well over the entire training data. Because the approach is not targeting the next learner for inclusion, it tends to provide a natural robustness to noisy labels. In this work, we introduce the ensemble bridge algorithm, which is capable of transitioning between boosting and random forests using a regularization parameter nu in [0,1]. Because the ensemble bridge algorithm is a compromise between the greedy nature of boosting and the randomness present in random forests, it yields robust performance in the presence of a noisy response and superior performance in the presence of a clean response. Often, drug discovery data (e.g., computational chemistry data) have varying levels of noise. Hence, this method enables a practitioner to employ a single method to evaluate ensemble performance. The method's robustness is verified across a variety of data sets where the algorithm repeatedly yields better performance than either boosting or random forests alone. Finally, we provide diagnostic tools for the new algorithm, including a measure of variable importance and an observational clustering tool.