Rapidly Finding the Best Arm Using Variance

We address the problem of identifying the best arm in a pure-exploration multi-armed bandit problem. In this setting, the agent repeatedly pulls arms in order to identify the one associated with the maximum expected reward. We focus on the fixed-budget version of the problem in which the agent tries to find the best arm given a fixed number of arm pulls. We propose a novel sequential elimination method exploiting the empirical variance of the arms. We detail and analyse the overall approach providing theoretical and empirical results. The experimental evaluation shows the advantage of our variance-based rejection method in heterogeneous test settings, considering both identification accuracy and execution time.

[1]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[2]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[3]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[4]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[5]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[6]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[7]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[8]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[9]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[10]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[11]  Matthew Malloy,et al.  lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[12]  Alessandro Lazaric,et al.  Multi-Bandit Best Arm Identification , 2011, NIPS.

[13]  Vahid Tarokh,et al.  On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits , 2016, IEEE Transactions on Signal Processing.

[14]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[15]  Alexandra Carpentier,et al.  Tight (Lower) Bounds for the Fixed Budget Best Arm Identification Bandit Problem , 2016, COLT.

[16]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[17]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[18]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[19]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[20]  Mykel J. Kochenderfer,et al.  Bayesian Preference Elicitation for Multiobjective Engineering Design Optimization , 2015, J. Aerosp. Inf. Syst..