Scalable Ensemble Learning and Computationally Efficient Variance Estimation

Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm is an ensemble method that has been theoretically proven to represent an asymptotically optimal system for learning. The Super Learner, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training multiple base learning algorithms. We present several practical solutions to reducing the computational burden of ensemble learning while retaining superior model performance, along with software, code examples and benchmarks. Further, we present a generalized metalearning method for approximating the combination of the base learners which maximizes a model performance metric of interest. As an example, we create an AUC-maximizing Super Learner and show that this technique works especially well in the case of imbalanced binary outcomes. We conclude by presenting a computationally efficient approach to approximating variance for cross-validated AUC estimates using influence functions. This technique can be used generally to obtain confidence intervals for any estimator, however, due to the extensive use of AUC in the field of biostatistics, cross-validated AUC is used as a practical, motivating example.The goal of this body of work is to provide new scalable approaches to obtaining the highest performing predictive models while optimizing any model performance metric of interest, and further, to provide computationally efficient inference for that estimate.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[4]  Mohammad H Rahbar,et al.  Time-dependent prediction and evaluation of variable importance using superlearning in high-dimensional clinical data , 2013, The journal of trauma and acute care surgery.

[5]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[6]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[7]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[8]  C. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[9]  Richard Grieve,et al.  Health econometric evaluation of the effects of a continuous treatment: a machine learning approach , 2014 .

[10]  A. Nobel Histogram regression estimation using data-dependent partitions , 1996 .

[11]  Erin LeDell,et al.  Super Learner Analysis of Electronic Adherence Data Improves Viral Prediction and May Provide Strategies for Selective HIV RNA Monitoring , 2015, Journal of acquired immune deficiency syndromes.

[12]  Arthur E. Hoerl,et al.  Application of ridge analysis to regression problems , 1962 .

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[15]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[16]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[17]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[18]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[19]  Mark J. van der Laan,et al.  A Scalable Supervised Subsemble Prediction Algorithm , 2014 .

[20]  Aad van der Vaart,et al.  The Cross-Validated Adaptive Epsilon-Net Estimator , 2006 .

[21]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[22]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[25]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[26]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[27]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[28]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[29]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[30]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[31]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[32]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[33]  Bertrand Clarke,et al.  Comparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot be Ignored , 2003, J. Mach. Learn. Res..

[34]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[35]  I. T. Jolliffe,et al.  Springer series in statistics , 1986 .

[36]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[37]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[38]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[39]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations , 1970 .

[40]  Richard Grieve,et al.  Evaluation of the Effect of a Continuous Treatment: A Machine Learning Approach with an Application to Treatment for Traumatic Brain Injury , 2015, Health economics.

[41]  Mark J van der Laan,et al.  Targeted Maximum Likelihood Estimation of Natural Direct Effects , 2012, The international journal of biostatistics.

[42]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[43]  Brian D. Ripley,et al.  Modern Applied Statistics with S Fourth edition , 2002 .

[44]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[45]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[46]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[47]  Claude J. P. Bélisle Convergence theorems for a class of simulated annealing algorithms on ℝd , 1992 .

[48]  M. J. van der Laan,et al.  Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. , 2015, The Lancet. Respiratory medicine.

[49]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[50]  John Canny,et al.  Subsemble: an ensemble method for combining subset-specific algorithm fits , 2014, Journal of applied statistics.

[51]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[52]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[53]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 2. The New Algorithm , 1970 .

[54]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[55]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[56]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[57]  Alan Edelman,et al.  Julia: A Fast Dynamic Language for Technical Computing , 2012, ArXiv.

[58]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[59]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[60]  R. Gill Non- and semi-parametric maximum likelihood estimators and the Von Mises method , 1986 .

[61]  C. Metz,et al.  A receiver operating characteristic partial area index for highly sensitive diagnostic tests. , 1996, Radiology.

[62]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[63]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[64]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[65]  Mark J. van der Laan,et al.  Super Learner In Prediction , 2010 .

[66]  øöö Blockinøø Well-Trained PETs : Improving Probability Estimation , 2000 .

[67]  Carey E. Priebe,et al.  COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[68]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[69]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[70]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[71]  Larry Nazareth,et al.  A family of variable metric updates , 1977, Math. Program..

[72]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[73]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[74]  M. J. van der Laan,et al.  Super-Learning of an Optimal Dynamic Treatment Rule , 2016, The international journal of biostatistics.

[75]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.