Beyond UCB: Optimal and Efficient Contextual Bandits with Regression Oracles

A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks such as classification and regression. Algorithms based on regression have shown promising empirical success, but theoretical guarantees have remained elusive except in special cases. We provide the first universal and optimal reduction from contextual bandits to online regression. We show how to transform any oracle for online regression with a given value function class into an algorithm for contextual bandits with the induced policy class, with no overhead in runtime or memory requirements. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the oracle obtains the optimal rate for regression. Compared to previous results, our algorithm requires no distributional assumptions beyond realizability, and works even when contexts are chosen adversarially.

[1]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[2]  Karthik Sridharan,et al.  BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits , 2016, ICML.

[3]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[4]  Gergely Neu,et al.  Efficient and Robust Algorithms for Adversarial Linear Contextual Bandits , 2020, COLT 2020.

[5]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[6]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[7]  Vladimir Vovk,et al.  Metric entropy in competitive on-line prediction , 2006, ArXiv.

[8]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[9]  John Langford,et al.  Practical Evaluation and Optimization of Contextual Bandit Algorithms , 2018, ArXiv.

[10]  Michael Kearns,et al.  Large-Scale Bandit Problems and KWIK Learning , 2013, ICML.

[11]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.

[12]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[13]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[14]  Nello Cristianini,et al.  Finite-Time Analysis of Kernelised Contextual Bandits , 2013, UAI.

[15]  John Langford,et al.  Open Problem: First-Order Regret Bounds for Contextual Bandits , 2017, COLT.

[16]  Pierre Gaillard,et al.  A Chaining Algorithm for Online Nonparametric Regression , 2015, COLT.

[17]  Karthik Sridharan,et al.  Statistical Learning and Sequential Prediction , 2014 .

[18]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[19]  John Langford,et al.  Making Contextual Decisions with Low Technical Debt , 2016 .

[20]  Karthik Sridharan,et al.  Online Nonparametric Regression , 2014, ArXiv.

[21]  Koby Crammer,et al.  A generalized online mirror descent with applications to classification and regression , 2013, Machine Learning.

[22]  Akshay Krishnamurthy,et al.  Contextual semibandits via supervised learning oracles , 2015, NIPS.

[23]  Ambuj Tewari,et al.  On the Universality of Online Mirror Descent , 2011, NIPS.

[24]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[25]  Philip M. Long,et al.  Reinforcement Learning with Immediate Rewards and Linear Hypotheses , 2003, Algorithmica.

[26]  John Langford,et al.  Efficient Optimal Learning for Contextual Bandits , 2011, UAI.

[27]  Shai Ben-David,et al.  Multiclass Learnability and the ERM principle , 2011, COLT.

[28]  John Langford,et al.  A Contextual Bandit Bake-off , 2018, J. Mach. Learn. Res..

[29]  Zheng Wen,et al.  New Insights into Bootstrapping for Bandits , 2018, ArXiv.

[30]  John Langford,et al.  Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits , 2014, ICML.

[31]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[32]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[33]  Sébastien Gerchinovitz,et al.  Sparsity Regret Bounds for Individual Sequences in Online Linear Regression , 2011, COLT.

[34]  Akshay Krishnamurthy,et al.  Contextual bandits with surrogate losses: Margin bounds and efficient algorithms , 2018, NeurIPS.

[35]  Eli Upfal,et al.  Bandits and Experts in Metric Spaces , 2013, J. ACM.

[36]  Karthik Sridharan,et al.  Empirical Entropy, Minimax Regret and Minimax Risk , 2013, ArXiv.

[37]  S. Mendelson,et al.  Entropy and the combinatorial dimension , 2002, math/0203275.

[38]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[39]  Benjamin Van Roy,et al.  Eluder Dimension and the Sample Complexity of Optimistic Exploration , 2013, NIPS.

[40]  Tor Lattimore,et al.  Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits , 2018, ICML.

[41]  Amit Daniely,et al.  Strongly Adaptive Online Learning , 2015, ICML.

[42]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[43]  Yuan Zhou,et al.  Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits , 2019, COLT.

[44]  Haipeng Luo,et al.  Practical Contextual Bandits with Regression Oracles , 2018, ICML.

[45]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[46]  Haipeng Luo,et al.  Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits , 2016, NIPS.

[47]  Ambuj Tewari,et al.  From Ads to Interventions: Contextual Bandits in Mobile Health , 2017, Mobile Health - Sensors, Analytic Methods, and Applications.

[48]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[49]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[50]  John Langford,et al.  The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information , 2007, NIPS.

[51]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[52]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[53]  Philip M. Long,et al.  Prediction, Learning, Uniform Convergence, and Scale-Sensitive Dimensions , 1998, J. Comput. Syst. Sci..

[54]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[55]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[56]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[57]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[58]  John Langford,et al.  Contextual Bandit Learning with Predictable Rewards , 2012, AISTATS.

[59]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[60]  Akshay Krishnamurthy,et al.  Efficient Algorithms for Adversarial Contextual Learning , 2016, ICML.

[61]  Haipeng Luo,et al.  Model selection for contextual bandits , 2019, NeurIPS.

[62]  Yuanzhi Li,et al.  Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[63]  Philippe Rigollet,et al.  Nonparametric Bandits with Covariates , 2010, COLT.

[64]  John Langford,et al.  Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.

[65]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..