A Scalable Supervised Subsemble Prediction Algorithm

Subsemble is a flexible ensemble method that partitions a full data set into subsets of observations, fits the same algorithm on each subset, and uses a tailored form of V-fold cross-validation to construct a prediction function that combines the subset-specific fits with a second metalearner algorithm. Previous work studied the performance of Subsemble with subsets created randomly, and showed that these types of Subsembles often result in better prediction performance than the underlying algorithm fit just once on the full dataset. Since the final Subsemble estimator varies depending on the data used to create the subsetspecific fits, different strategies for creating the subsets used in Subsemble result in different Subsembles. We propose supervised partitioning of the covariate space to create the subsets used in Subsemble, and using a form of histogram regression as the metalearner used to combine the subset-specific fits. We discuss applications to large-scale data sets, and develop a practical Supervised Subsemble method using regression trees to both create the covariate space partitioning, and select the number of subsets used in Subsemble. Through simulations and real data analysis, we show that this subset creation method can have better prediction performance than the random subset version.

[1]  John Canny,et al.  Subsemble: an ensemble method for combining subset-specific algorithm fits , 2014, Journal of applied statistics.

[2]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[3]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[4]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[5]  Lucia Nemčíková,et al.  Classification and Regression Trees in R , 2014 .

[6]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .