Stochastic Contextual Bandits with Long Horizon Rewards

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

[1]  Robert D. Kleinberg,et al.  Online Convex Optimization with Unbounded Memory , 2022, ArXiv.

[2]  S. Filippi,et al.  Delayed Feedback in Generalised Linear Bandits Revisited , 2022, AISTATS.

[3]  Samet Oymak,et al.  Representation Learning for Context-Dependent Decision-Making , 2022, 2022 American Control Conference (ACC).

[4]  Samet Oymak,et al.  Non-Stationary Representation Learning in Sequential Linear Bandits , 2022, IEEE Open Journal of Control Systems.

[5]  Cheng Soon Ong,et al.  Gaussian Process Bandits with Aggregated Feedback , 2021, AAAI.

[6]  Yishay Mansour,et al.  Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions , 2021, ICML.

[7]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[8]  Tor Lattimore,et al.  Information Directed Sampling for Sparse Linear Bandits , 2021, NeurIPS.

[9]  Longbo Huang,et al.  Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback , 2020, AAAI.

[10]  Tor Lattimore,et al.  High-Dimensional Sparse Linear Bandits , 2020, NeurIPS.

[11]  A. Proutière,et al.  Thresholded LASSO Bandit , 2020, ICML.

[12]  Pooria Joulani,et al.  Adapting to Delays and Data in Adversarial Multi-Armed Bandits , 2020, ICML.

[13]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[14]  Zhengyuan Zhou,et al.  Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits , 2020, Management Science.

[15]  Min-hwan Oh,et al.  Sparsity-Agnostic Lasso Bandit , 2020, ICML.

[16]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[17]  Michal Valko,et al.  Stochastic bandits with arm-dependent delays , 2020, ICML.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Soon-Jo Chung,et al.  Online Optimization with Memory and Competitive Control , 2020, NeurIPS.

[20]  N. Cesa-Bianchi,et al.  Stochastic Bandits with Delay-Dependent Payoffs , 2019, AISTATS.

[21]  Aditya Kumar Akash,et al.  Stochastic Bandits with Delayed Composite Anonymous Feedback , 2019, ArXiv.

[22]  Julian Zimmert,et al.  An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays , 2019, AISTATS.

[23]  Gi-Soo Kim,et al.  Doubly-Robust Lasso Bandit , 2019, NeurIPS.

[24]  Nicolò Cesa-Bianchi,et al.  Nonstochastic Multiarmed Bandits with Unrestricted Delays , 2019, NeurIPS.

[25]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[26]  Georgios B. Giannakis,et al.  Bandit Online Learning with Unknown Delays , 2018, AISTATS.

[27]  Tor Lattimore,et al.  Linear bandits with Stochastic Delayed Feedback , 2018, ICML.

[28]  Claudio Gentile,et al.  Nonstochastic Bandits with Composite Anonymous Feedback , 2018, COLT.

[29]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[30]  Csaba Szepesvári,et al.  Bandits with Delayed, Aggregated Anonymous Feedback , 2017, ICML.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Lihong Li,et al.  Provable Optimal Algorithms for Generalized Linear Contextual Bandits , 2017, ArXiv.

[33]  Claudio Gentile,et al.  Delay and Cooperation in Nonstochastic Bandits , 2016, COLT.

[34]  Justin K. Romberg,et al.  An Overview of Low-Rank Matrix Recovery From Incomplete Observations , 2016, IEEE Journal of Selected Topics in Signal Processing.

[35]  Shie Mannor,et al.  Online Learning for Adversaries with Memory: Price of Past Mistakes , 2015, NIPS.

[36]  Mohsen Bayati,et al.  Online Decision-Making with High-Dimensional Covariates , 2015 .

[37]  R. Adamczak,et al.  A note on the Hanson-Wright inequality for random vectors with dependencies , 2014, 1409.8457.

[38]  Holger Rauhut,et al.  A Mathematical Introduction to Compressive Sensing , 2013, Applied and Numerical Harmonic Analysis.

[39]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[40]  Yonina C. Eldar,et al.  Simultaneously Structured Models With Application to Sparse and Low-Rank Matrices , 2012, IEEE Transactions on Information Theory.

[41]  Holger Rauhut,et al.  Suprema of Chaos Processes and the Restricted Isometry Property , 2012, ArXiv.

[42]  Nicolas Vayatis,et al.  Estimation of Simultaneously Sparse and Low Rank Matrices , 2012, ICML.

[43]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[44]  Csaba Szepesvári,et al.  Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits , 2012, AISTATS.

[45]  Rémi Munos,et al.  Bandit Theory meets Compressed Sensing for high dimensional Stochastic Linear Bandit , 2012, AISTATS.

[46]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[47]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[48]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[49]  Massimo Fornasier,et al.  Theoretical Foundations and Numerical Methods for Sparse Recovery , 2010, Radon Series on Computational and Applied Mathematics.

[50]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[51]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[52]  E. Candès,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[53]  M. Woodroofe A One-Armed Bandit Problem with a Concomitant Variable , 1979 .

[54]  P. Wedin Perturbation bounds in connection with singular value decomposition , 1972 .

[55]  Ken-ichi Kawarabayashi,et al.  Delay and Cooperation in Nonstochastic Linear Bandits , 2020, NeurIPS.

[56]  Xi Chen,et al.  Online EXP3 Learning in Adversarial Bandits with Delayed Feedback , 2019, NeurIPS.

[57]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[58]  Xue Wang,et al.  Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates , 2018, ICML.