Sublinear Optimal Policy Value Estimation in Contextual Bandits

We study the problem of estimating the expected reward of the optimal policy in the stochastic disjoint linear bandit setting. We prove that for certain settings it is possible to obtain an accurate estimate of the optimal policy value even with a number of samples that is sublinear in the number that would be required to \emph{find} a policy that realizes a value close to this optima. We establish nearly matching information theoretic lower bounds, showing that our algorithm achieves near optimal estimation error. Finally, we demonstrate the effectiveness of our algorithm on joke recommendation and cancer inhibition dosage selection problems using real datasets.

[1]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[2]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[5]  Alessandro Lazaric,et al.  Best-Arm Identification in Linear Bandits , 2014, NIPS.

[6]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[7]  Richard G. Baraniuk,et al.  A Contextual Bandits Framework for Personalized Learning Action Selection , 2016, EDM.

[8]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[9]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[10]  Kristjan H. Greenewald,et al.  Action Centered Contextual Bandits , 2017, NIPS.

[11]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[12]  Stefan Wager,et al.  Efficient Policy Learning , 2017, ArXiv.

[13]  Nando de Freitas,et al.  On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning , 2014, AISTATS.

[14]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[15]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[16]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[17]  Emma Brunskill,et al.  Off-Policy Policy Gradient with State Distribution Correction , 2019, UAI 2019.

[18]  S. Chatterjee An error bound in the Sudakov-Fernique inequality , 2005, math/0510424.

[19]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[20]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[21]  Li Zhou,et al.  Latent Contextual Bandits and their Application to Personalized Recommendations for New Users , 2016, IJCAI.

[22]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[23]  Masashi Sugiyama,et al.  Fully adaptive algorithm for pure exploration in linear bandits , 2017, 1710.05552.

[24]  Zoran Popovic,et al.  Where to Add Actions in Human-in-the-Loop Reinforcement Learning , 2017, AAAI.

[25]  Emma Brunskill,et al.  Value Driven Representation for Human-in-the-Loop Reinforcement Learning , 2019, UMAP.

[26]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[27]  Gregory Valiant,et al.  Estimating Learnability in the Sublinear Data Regime , 2018, NeurIPS.

[28]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[29]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[30]  A . Proof of Proposition 1 Proof , 2020 .