Policies without Memory for the Infinite-Armed Bernoulli Bandit under the Average-Reward Criterion
暂无分享,去创建一个
We consider a bandit problem with infinitely many Bernoulli arms whose unknown parameters are i.i.d. We present two policies that maximize the almost sure average reward over an infinite horizon. Neither policy ever returns to a previously observed arm after switching to a new one or retains information from discarded arms, and runs of failures indicate the selection of a new arm. The first policy is nonstationary and requires no information about the distribution of the Bernoulli parameter. The second is stationary and requires only partial information; its optimality is established via renewal theory. We also develop e-optimal stationary policies that require no information about the distribution of the unknown parameter and discuss universally optimal stationary policies.
[1] Wing W. Lowe,et al. Nonparametric bandit methods , 1991 .
[2] H. Robbins,et al. Some problems of optimal sampling strategy , 1964 .
[3] Sheldon M. Ross,et al. Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.
[4] Robert W. Chen,et al. Bandit problems with infinitely many arms , 1997 .