Statistical Inference with M-Estimators on Bandit Data

Bandit algorithms are increasingly used in real world sequential decision making problems, from online advertising to mobile health. As a result, there are more datasets collected using bandit algorithms and with that an increased desire to be able to use these datasets to answer scientific questions like: Did one type of ad increase the click-through rate more or lead to more purchases? In which contexts is a mobile health intervention effective? However, it has been shown that classical statistical approaches, like those based on the ordinary least squares estimator, fail to provide reliable confidence intervals when used with bandit data. Recently methods have been developed to conduct statistical inference using simple models fit to data collected with multi-armed bandits. However there is a lack of general methods for conducting statistical inference using more complex models. In this work, we develop theory justifying the use of M-estimation (Van der Vaart, 2000), traditionally used with i.i.d data, to provide inferential methods for a large class of estimators—including least squares and maximum likelihood estimators—but now with data collected with (contextual) bandit algorithms. To do this we generalize the use of adaptive weights pioneered by Hadad et al. (2019) and Deshpande et al. (2018). Specifically, in settings in which the data is collected via a (contextual) bandit algorithm, we prove that certain adaptively weighted M-estimators are uniformly asymptotically normal and demonstrate empirically that we can use their asymptotic distribution to construct reliable confidence regions for a variety of inferential targets.

[1]  I. A. Ibragimov,et al.  ASYMPTOTIC NORMALITY FOR SUMS OF DEPENDENT RANDOM VARIABLES , 2005 .

[2]  Christopher J. Miller,et al.  Adoption of Mobile Apps for Depression and Anxiety: Cross-Sectional Survey Study on Patient Interest and Barriers to Engagement , 2019, JMIR mental health.

[3]  Kelly W. Zhang,et al.  Inference for Batched Bandits , 2020, NeurIPS.

[4]  Stefan Wager,et al.  Confidence intervals for policy evaluation in adaptive experiments , 2021, Proceedings of the National Academy of Sciences.

[5]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[6]  Zoran Popovic,et al.  Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits , 2014, EDM.

[7]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[8]  Stephen M Schueller,et al.  State of the Field of Mental Health Apps. , 2018, Cognitive and behavioral practice.

[9]  D. Green,et al.  Adaptive Experimental Design: Prospects and Applications in Political Science , 2019, American Journal of Political Science.

[10]  Joseph Jay Williams,et al.  Balancing Student Success and Inferring Personalized Effects in Dynamic Experiments , 2019, EDM.

[11]  A. Agresti Foundations of Linear and Generalized Linear Models , 2015 .

[12]  Robert Yu,et al.  Mobile Health Interventions for Self-Control of Unhealthy Alcohol Use: Systematic Review , 2019, JMIR mHealth and uHealth.

[13]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[14]  The effectiveness of smartphone applications to aid smoking cessation: A meta-analysis , 2020 .

[15]  P. Klasnja,et al.  A quality-improvement optimization pilot of BariFit, a mobile health intervention to promote physical activity after bariatric surgery. , 2020, Translational behavioral medicine.

[16]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[17]  Joseph Jay Williams,et al.  Statistical Consequences of using Multi-armed Bandits to Conduct Adaptive Educational Experiments , 2019 .

[18]  Vasilis Syrgkanis,et al.  Accurate Inference for Adaptive Linear Models , 2017, ICML.

[19]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[20]  Maximilian Kasy,et al.  An Adaptive Targeted Field Experiment: Job Search Assistance for Refugees in Jordan , 2020, SSRN Electronic Journal.

[21]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[22]  Maximilian Kasy,et al.  Uniformity and the Delta Method , 2015, Journal of Econometric Methods.

[23]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[24]  Tianchen Qian,et al.  Notifications to Improve Engagement With an Alcohol Reduction App: Protocol for a Micro-Randomized Trial , 2020, JMIR research protocols.

[25]  Alessandro Lazaric,et al.  Trading off Rewards and Errors in Multi-Armed Bandits , 2017, AISTATS.

[26]  C. Assaid,et al.  The Theory of Response-Adaptive Randomization in Clinical Trials , 2007 .

[27]  L. Forzani,et al.  Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models , 2017, Statistics.

[28]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[29]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[30]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[31]  Joseph P. Romano,et al.  On the uniform asymptotic validity of subsampling and the bootstrap , 2012, 1204.2762.

[32]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[33]  T. Lai,et al.  Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems , 1982 .

[34]  Feifang Hu,et al.  Efficient randomized-adaptive designs , 2009, 0908.3435.