Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning

Empirical risk minimization (ERM) is the workhorse of machine learning, whether for classification and regression or for off-policy policy learning, but its modelagnostic guarantees can fail when we use adaptively collected data, such as the result of running a contextual bandit algorithm. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class and provide first-of-their-kind generalization guarantees and fast convergence rates. Our results are based on a new maximal inequality that carefully leverages the importance sampling structure to obtain rates with the good dependence on the exploration rate in the data. For regression, we provide fast rates that leverage the strong convexity of squared-error loss. For policy learning, we provide regret guarantees that close an open gap in the existing literature whenever exploration decays to zero, as is the case for bandit-collected data. An empirical investigation validates our theory.

[1]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[2]  Ambuj Tewari,et al.  Sequential complexities and uniform martingale laws of large numbers , 2015 .

[3]  Nathan Kallus,et al.  Fast Rates for Contextual Linear Optimization , 2020, Manag. Sci..

[4]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[5]  Stefan Wager,et al.  Policy Learning With Observational Data , 2017, Econometrica.

[6]  M. J. van der Laan,et al.  STATISTICAL INFERENCE FOR THE MEAN OUTCOME UNDER A POSSIBLY NON-UNIQUE OPTIMAL TREATMENT STRATEGY. , 2016, Annals of statistics.

[7]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[8]  Stefan Wager,et al.  Confidence intervals for policy evaluation in adaptive experiments , 2021, Proceedings of the National Academy of Sciences.

[9]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[10]  Yi Su,et al.  CAB: Continuous Adaptive Blending for Policy Evaluation and Learning , 2019, ICML.

[11]  B. Ozler A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in Cameroon , 2018, AEA Randomized Controlled Trials.

[12]  Toru Kitagawa,et al.  Who should be Treated? Empirical Welfare Maximization Methods for Treatment Choice , 2015 .

[13]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[14]  Aurélien F. Bibaut,et al.  Fast rates for empirical risk minimization over c\`adl\`ag functions with bounded sectional variation norm , 2019 .

[15]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[16]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[17]  Mark J van der Laan,et al.  Empirical Efficiency Maximization: Improved Locally Efficient Covariate Adjustment in Randomized Experiments and Survival Analysis , 2008, The international journal of biostatistics.

[18]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[19]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[20]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[21]  Ramon van Handel On the minimal penalty for Markov order estimation , 2009, ArXiv.

[22]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[23]  Masatoshi Uehara,et al.  Optimal Off-Policy Evaluation from Multiple Logging Policies , 2020, ICML.

[24]  Susan Athey,et al.  Estimation Considerations in Contextual Bandits , 2017, ArXiv.

[25]  Akshay Krishnamurthy,et al.  Contextual bandits with surrogate losses: Margin bounds and efficient algorithms , 2018, NeurIPS.

[26]  Vasilis Syrgkanis,et al.  Orthogonal Statistical Learning , 2019, The Annals of Statistics.

[27]  Mohsen Bayati,et al.  Dynamic Pricing with Demand Covariates , 2016, 1604.07463.

[28]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[29]  Masatoshi Uehara,et al.  Fast Rates for the Regret of Offline Reinforcement Learning , 2021, COLT.

[30]  B. Karrer,et al.  AE: A domain-agnostic platform for adaptive experimentation , 2018 .

[31]  Donglin Zeng,et al.  Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[32]  Nathan Kallus,et al.  Smooth Contextual Bandits: Bridging the Parametric and Non-differentiable Regret Regimes , 2019, COLT.

[33]  Madeleine Udell,et al.  Dynamic Assortment Personalization in High Dimensions , 2016, Oper. Res..

[34]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[35]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[36]  David Simchi-Levi,et al.  Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability , 2020, SSRN Electronic Journal.

[37]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[38]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[39]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[40]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[41]  M. Ossiander,et al.  A Central Limit Theorem Under Metric Entropy with $L_2$ Bracketing , 1987 .

[42]  A sequential and adaptive experiment to increase the uptake of long-acting reversible contraceptives in Cameroon , 2018, AEA Randomized Controlled Trials.

[43]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[44]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[45]  Edward H. Kennedy Optimal doubly robust estimation of heterogeneous causal effects , 2020, 2004.14497.

[46]  Antoine Chambaz,et al.  Generalized Policy Elimination: an efficient algorithm for Nonparametric Contextual Bandits , 2020, UAI.

[47]  M. Davidian,et al.  Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data , 2009, Biometrika.

[48]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[49]  D. Freedman,et al.  Weighting Regressions by Propensity Scores , 2008, Evaluation review.

[50]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[51]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[52]  Zhengyuan Zhou,et al.  Policy Learning with Adaptively Collected Data , 2021, ArXiv.

[53]  Alexander Rakhlin,et al.  Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles , 2020, ICML.

[54]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[55]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.