Estimation Considerations in Contextual Bandits

Contextual bandit algorithms seek to learn a personalized treatment assignment policy, balancing exploration against exploitation. Although a number of algorithms have been proposed, there is little guidance available for applied researchers to select among various approaches. Motivated by the econometrics and statistics literatures on causal effects estimation, we study a new consideration to the exploration vs. exploitation framework, which is that the way exploration is conducted in the present may contribute to the bias and variance in the potential outcome model estimation in subsequent stages of learning. We leverage parametric and non-parametric statistical estimation methods and causal effect estimation methods in order to propose new contextual bandit designs. Through a variety of simulations, we show how alternative design choices impact the learning performance and provide insights on why we observe these effects.

[1]  Xinkun Nie,et al.  Why adaptively collected data have negative bias and how to correct for it , 2017, AISTATS.

[2]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[3]  Ambuj Tewari,et al.  An Actor-Critic Contextual Bandit Algorithm for Personalized Mobile Health Interventions , 2017, ArXiv.

[4]  Nathan Kallus,et al.  Balanced Policy Evaluation and Learning , 2017, NeurIPS.

[5]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[6]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[7]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[8]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9]  Benjamin Van Roy,et al.  A Tutorial on Thompson Sampling , 2017, Found. Trends Mach. Learn..

[10]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[11]  John Langford,et al.  Making Contextual Decisions with Low Technical Debt , 2016 .

[12]  Elias Bareinboim,et al.  Fairness in Decision-Making - The Causal Explanation Formula , 2018, AAAI.

[13]  Wei Chu,et al.  An unbiased offline evaluation of contextual bandit algorithms with generalized linear models , 2011 .

[14]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[15]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[16]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[17]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[18]  John Langford,et al.  Practical Evaluation and Optimization of Contextual Bandit Algorithms , 2018, ArXiv.

[19]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[20]  Liang Tang,et al.  Personalized Recommendation via Parameter-Free Contextual Bandits , 2015, SIGIR.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Lihong Li,et al.  Counterfactual Estimation and Optimization of Click Metrics for Search Engines , 2014, ArXiv.

[23]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[24]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[25]  Vianney Perchet,et al.  The multi-armed bandit problem with covariates , 2011, ArXiv.

[26]  Elias Bareinboim,et al.  Causal inference and the data-fusion problem , 2016, Proceedings of the National Academy of Sciences.

[27]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[28]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[29]  Jack Bowden,et al.  Multi-armed Bandit Models for the Optimal Design of Clinical Trials: Benefits and Challenges. , 2015, Statistical science : a review journal of the Institute of Mathematical Statistics.

[30]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  A. Zeevi,et al.  A Linear Response Bandit Problem , 2013 .

[33]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[34]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[35]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[36]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[37]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[38]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[39]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[40]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[41]  Shipra Agrawal,et al.  Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.

[42]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[43]  Vasilis Syrgkanis,et al.  Accurate Inference for Adaptive Linear Models , 2017, ICML.

[44]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[45]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[46]  Raphaël Féraud,et al.  Random Forest for the Contextual Bandit Problem , 2015, AISTATS.

[47]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .