Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Consider the problem of sampling sequentially from a finite number of N > 2 populations, specified by random variables X i k , i = 1;:::; N; and k = 1; 2;:::; where X i k denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i,fX i k gk>1 is a sequence of i.i.d. normal random variables, with unknown mean mi and unknown variance s 2 i . The objective is to have a policy p for deciding from which of the N populations to sample form at any time n = 1; 2;::: so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters mi and s 2 i . In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996b). Additionally, finite horizon regret bounds are given.

[1]  Akimichi Takemura,et al.  An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[2]  Michael N. Katehakis,et al.  An Asymptotically Optimal UCB Policy for Uniform Bandits of Unknown Support , 2015, ArXiv.

[3]  A. Burnetas,et al.  Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[4]  Emilie Kaufmann,et al.  Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[5]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[6]  Michael Z. Zgurovsky,et al.  Convergence of value iterations for total-cost MDPs and POMDPs with general state and action sets , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[7]  M. Katehakis,et al.  Simple Policies with (a.s.) Arbitrarily Slow Growing Regret for Sequential Allocation Problems , 2015 .

[8]  Lihong Li,et al.  On Minimax Optimal Offline Policy Evaluation , 2014, ArXiv.

[9]  Aurélien Garivier,et al.  Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence , 2010, ArXiv.

[10]  Benjamin Van Roy,et al.  Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[11]  Warren B. Powell,et al.  Asymptotically optimal Bayesian sequential change detection and identification rules , 2013, Ann. Oper. Res..

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  A. Burnetas,et al.  ASYMPTOTIC BAYES ANALYSIS FOR THE FINITE-HORIZON ONE-ARMED-BANDIT PROBLEM , 2003, Probability in the Engineering and Informational Sciences.

[14]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[15]  H Robbins,et al.  Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[17]  Akimichi Takemura,et al.  Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[18]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[19]  M. Katehakis,et al.  MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT , 2014, Probability in the Engineering and Informational Sciences.

[20]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[21]  Aleksandrs Slivkins,et al.  25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[22]  R. Weber On the Gittins Index for Multiarmed Bandits , 1992 .

[23]  Apostolos Burnetas,et al.  On Sequencing Two Types of Tasks on a Single Processor under Incomplete Information , 1993, Probability in the Engineering and Informational Sciences.

[24]  Akimichi Takemura,et al.  An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[25]  Michael N. Katehakis,et al.  Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret , 2015, ArXiv.

[26]  Michael L. Littman,et al.  Inducing Partially Observable Markov Decision Processes , 2012, ICGI.

[27]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[28]  A. Burnetas,et al.  Dynamic allocation policies for the finite horizon one armed bandit problem , 1998 .

[29]  Panos M. Pardalos,et al.  Cooperative Control: Models, Applications, and Algorithms , 2003 .

[30]  J. Gittins Bandit processes and dynamic allocation indices , 1979 .

[31]  Mingyan Liu,et al.  Approximately optimal adaptive learning in opportunistic spectrum access , 2012, 2012 Proceedings IEEE INFOCOM.

[32]  Michael N. Katehakis,et al.  COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS , 1986 .

[33]  Peter Auer,et al.  UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[34]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[35]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[36]  Uriel G. Rothblum,et al.  The multi-armed bandit, with constraints , 2012, Annals of Operations Research.

[37]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[38]  Wassim Jouini,et al.  Multi-armed bandit based policies for cognitive radio's decision making issues , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[39]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[40]  Apostolos Burnetas,et al.  On large deviations properties of sequential allocation problems , 1996 .