Linearly Parameterized Bandits

We consider bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an r-dimensional random vector Z ∈ Rr, where r ≥ 2. The objective is to minimize the cumulative regret and Bayes risk. When the set of arms corresponds to the unit sphere, we prove that the regret and Bayes risk is of order Θ(r √T), by establishing a lower bound for an arbitrary policy, and showing that a matching upper bound is obtained through a policy that alternates between exploration and exploitation phases. The phase-based policy is also shown to be effective if the set of arms satisfies a strong convexity condition. For the case of a general set of arms, we describe a near-optimal policy whose regret and Bayes risk admit upper bounds of the form O(r √T log3/2T).

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[3]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[4]  J. Blum Multidimensional Stochastic Approximation Methods , 1954 .

[5]  Dorian Feldman Contributions to the "Two-Armed Bandit" Problem , 1962 .

[6]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[7]  R. Keener Further Contributions to the "Two-Armed Bandit" Problem , 1985 .

[8]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[9]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[10]  D. Teneketzis,et al.  Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Paramet , 1988 .

[11]  J. Ginebra,et al.  Response surface bandits , 1995 .

[12]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[13]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[14]  E. Polovinkin Strongly convex analysis , 1996 .

[15]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[16]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[17]  M. Fiedler,et al.  A new positive definite geometric mean of two positive definite matrices , 1997 .

[18]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[19]  R. Dudley,et al.  Uniform Central Limit Theorems: Notation Index , 2014 .

[20]  Philip M. Long,et al.  Associative Reinforcement Learning using Linear Probabilistic Concepts , 1999, ICML.

[21]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[22]  T. Lai Stochastic approximation: invited paper , 2003 .

[23]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[24]  T. Lai,et al.  SELF-NORMALIZED PROCESSES: EXPONENTIAL INEQUALITIES, MOMENT BOUNDS AND ITERATED LOGARITHM LAWS , 2004, math/0410102.

[25]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[26]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[27]  H. Vincent Poor,et al.  Bandit problems with side observations , 2005, IEEE Transactions on Automatic Control.

[28]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[29]  Sanjeev R. Kulkarni,et al.  Arbitrary side observations in bandit problems , 2005, Adv. Appl. Math..

[30]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[31]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[32]  H. Robbins A Stochastic Approximation Method , 1951 .

[33]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[34]  R. Bhatia Positive Definite Matrices , 2007 .

[35]  Assaf Zeevi,et al.  Performance Limitations in Bandit Problems with Side Observations , 2007 .

[36]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[37]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[38]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[39]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[40]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[41]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[42]  Baruch Awerbuch,et al.  Online linear optimization and adaptive routing , 2008, J. Comput. Syst. Sci..

[43]  John N. Tsitsiklis,et al.  A Structured Multiarmed Bandit Problem and the Greedy Policy , 2008, IEEE Transactions on Automatic Control.

[44]  A. Zeevi,et al.  Woodroofe's One-Armed Bandit Problem Revisited , 2009, 0909.0119.

[45]  Mark Broadie,et al.  General Bounds and Finite-Time Improvement for the Kiefer-Wolfowitz Stochastic Approximation Algorithm , 2011, Oper. Res..

[46]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .