论文信息 - Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Consider the problem of sampling sequentially from a finite number of N > 2 populations, specified by random variables X i k , i = 1;:::; N; and k = 1; 2;:::; where X i k denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i,fX i k gk>1 is a sequence of i.i.d. normal random variables, with unknown mean mi and unknown variance s 2 i . The objective is to have a policy p for deciding from which of the N populations to sample form at any time n = 1; 2;::: so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters mi and s 2 i . In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996b). Additionally, finite horizon regret bounds are given.

Michael N. Katehakis | Wesley Cowan | M. Katehakis | Wesley Cowan

[1] Akimichi Takemura,et al. An Asymptotically Optimal Bandit Algorithm for Bounded Support Models. , 2010, COLT 2010.

[2] Michael N. Katehakis,et al. An Asymptotically Optimal UCB Policy for Uniform Bandits of Unknown Support , 2015, ArXiv.

[3] A. Burnetas,et al. Optimal Adaptive Policies for Sequential Allocation Problems , 1996 .

[4] Emilie Kaufmann,et al. Analysis of bayesian and frequentist strategies for sequential resource allocation. (Analyse de stratégies bayésiennes et fréquentistes pour l'allocation séquentielle de ressources) , 2014 .

[5] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[6] Michael Z. Zgurovsky,et al. Convergence of value iterations for total-cost MDPs and POMDPs with general state and action sets , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[7] M. Katehakis,et al. Simple Policies with (a.s.) Arbitrarily Slow Growing Regret for Sequential Allocation Problems , 2015 .

[8] Lihong Li,et al. On Minimax Optimal Offline Policy Evaluation , 2014, ArXiv.

[9] Aurélien Garivier,et al. Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence , 2010, ArXiv.

[10] Benjamin Van Roy,et al. Near-optimal Reinforcement Learning in Factored MDPs , 2014, NIPS.

[11] Warren B. Powell,et al. Asymptotically optimal Bayesian sequential change detection and identification rules , 2013, Ann. Oper. Res..

[12] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13] A. Burnetas,et al. ASYMPTOTIC BAYES ANALYSIS FOR THE FINITE-HORIZON ONE-ARMED-BANDIT PROBLEM , 2003, Probability in the Engineering and Informational Sciences.

[14] R. Munos,et al. Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[15] H Robbins,et al. Sequential choice from several populations. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[16] Michael N. Katehakis,et al. The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[17] Akimichi Takemura,et al. Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors , 2013, AISTATS.

[18] Ambuj Tewari,et al. Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[19] M. Katehakis,et al. MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT , 2014, Probability in the Engineering and Informational Sciences.

[20] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[21] Aleksandrs Slivkins,et al. 25th Annual Conference on Learning Theory The Best of Both Worlds: Stochastic and Adversarial Bandits , 2022 .

[22] R. Weber. On the Gittins Index for Multiarmed Bandits , 1992 .

[23] Apostolos Burnetas,et al. On Sequencing Two Types of Tasks on a Single Processor under Incomplete Information , 1993, Probability in the Engineering and Informational Sciences.

[24] Akimichi Takemura,et al. An asymptotically optimal policy for finite support models in the multiarmed bandit problem , 2009, Machine Learning.

[25] Michael N. Katehakis,et al. Asymptotic Behavior of Minimal-Exploration Allocation Policies: Almost Sure, Arbitrarily Slow Growing Regret , 2015, ArXiv.

[26] Michael L. Littman,et al. Inducing Partially Observable Markov Decision Processes , 2012, ICGI.

[27] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[28] A. Burnetas,et al. Dynamic allocation policies for the finite horizon one armed bandit problem , 1998 .

[29] Panos M. Pardalos,et al. Cooperative Control: Models, Applications, and Algorithms , 2003 .

[30] J. Gittins. Bandit processes and dynamic allocation indices , 1979 .

[31] Mingyan Liu,et al. Approximately optimal adaptive learning in opportunistic spectrum access , 2012, 2012 Proceedings IEEE INFOCOM.

[32] Michael N. Katehakis,et al. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS , 1986 .

[33] Peter Auer,et al. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem , 2010, Period. Math. Hung..

[34] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[35] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[36] Uriel G. Rothblum,et al. The multi-armed bandit, with constraints , 2012, Annals of Operations Research.

[37] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .

[38] Wassim Jouini,et al. Multi-armed bandit based policies for cognitive radio's decision making issues , 2009, 2009 3rd International Conference on Signals, Circuits and Systems (SCS).

[39] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[40] Apostolos Burnetas,et al. On large deviations properties of sequential allocation problems , 1996 .