Statistical Inference with M-Estimators on Adaptively Collected Data

Bandit algorithms are increasingly used in realworld sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators—which includes estimators based on empirical risk minimization as well as maximum likelihood—on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.

[1]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[2]  L. Forzani,et al.  Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models , 2017, Statistics.

[3]  Kelly W. Zhang,et al.  Inference for Batched Bandits , 2020, NeurIPS.

[4]  Stefan Wager,et al.  Confidence intervals for policy evaluation in adaptive experiments , 2021, Proceedings of the National Academy of Sciences.

[5]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[6]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[7]  Zoran Popovic,et al.  Trading Off Scientific Knowledge and User Learning with Multi-Armed Bandits , 2014, EDM.

[8]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[9]  I. A. Ibragimov,et al.  ASYMPTOTIC NORMALITY FOR SUMS OF DEPENDENT RANDOM VARIABLES , 2005 .

[10]  David M. Nickerson,et al.  Construction of a conservative confidence region from projections of an exact confidence region in multiple linear regression , 1994 .

[11]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[12]  James J. Heckman,et al.  Handbook of Econometrics , 1985 .

[13]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[14]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[15]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[16]  A. Agresti Foundations of Linear and Generalized Linear Models , 2015 .

[17]  Joseph Jay Williams,et al.  Balancing Student Success and Inferring Personalized Effects in Dynamic Experiments , 2019, EDM.

[18]  Kristjan H. Greenewald,et al.  Personalized HeartSteps , 2019, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[19]  Maximilian Kasy,et al.  Uniformity and the Delta Method , 2015, Journal of Econometric Methods.

[20]  Joseph Jay Williams,et al.  Statistical Consequences of using Multi-armed Bandits to Conduct Adaptive Educational Experiments , 2019 .

[21]  Susan Athey,et al.  Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits , 2021, KDD.

[22]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[23]  Moshe Tennenholtz,et al.  Encouraging Physical Activity in Patients With Diabetes: Intervention Using a Reinforcement Learning System , 2017, Journal of medical Internet research.

[24]  J. Cima,et al.  On weak* convergence in ¹ , 1996 .

[25]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[26]  B. M. Pötscher,et al.  MODEL SELECTION AND INFERENCE: FACTS AND FICTION , 2005, Econometric Theory.

[27]  Joseph P. Romano,et al.  On the uniform asymptotic validity of subsampling and the bootstrap , 2012, 1204.2762.

[28]  Rui Song,et al.  Statistical Inference for Online Decision Making: In a Contextual Bandit Setting , 2020, Journal of the American Statistical Association.

[29]  T. Lai,et al.  Least Squares Estimates in Stochastic Regression Models with Applications to Identification and Control of Dynamic Systems , 1982 .

[30]  Vasilis Syrgkanis,et al.  Accurate Inference for Adaptive Linear Models , 2017, ICML.

[31]  Maximilian Kasy,et al.  An Adaptive Targeted Field Experiment: Job Search Assistance for Refugees in Jordan , 2020, SSRN Electronic Journal.

[32]  Alessandro Lazaric,et al.  Trading off Rewards and Errors in Multi-Armed Bandits , 2017, AISTATS.

[33]  Maximilian Kasy,et al.  Adaptive Treatment Assignment in Experiments for Policy Choice , 2019, Econometrica.

[34]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.