论文信息 - CEMAB: A Cross-Entropy-based Method for Large-Scale Multi-Armed Bandits

CEMAB: A Cross-Entropy-based Method for Large-Scale Multi-Armed Bandits

The multi-armed bandit (MAB) problem is an important model for studying the exploration-exploitation tradeoff in sequential decision making. In this problem, a gambler has to repeatedly choose between a number of slot machine arms to maximize the total payout, where the total number of plays is fixed. Although many methods have been proposed to solve the MAB problem, most have been designed for problems with a small number of arms. To ensure convergence to the optimal arm, many of these methods, including state-of-the-art methods such as UCB [2], require sweeping over the entire set of arms. As a result, such methods perform poorly in problems with a large number of arms. This paper proposes a new method for solving such large-scale MAB problems. The method, called Cross-Entropy-based Multi Armed Bandit (CEMAB), uses the Cross-Entropy method as a noisy optimizer to find the optimal arm with as little cost as possible. Experimental results indicate that CEMAB outperforms state-of-the-art methods for solving MABs with a large number of arms.

Hanna Kurniawati | Dirk P. Kroese | Erli Wang | H. Kurniawati | Erli Wang

[1] Jason L. Loeppky,et al. A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit , 2015, ArXiv.

[2] Shipra Agrawal,et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[3] Michael L. Littman,et al. The Cross-Entropy Method Optimizes for Quantiles , 2013, ICML.

[4] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[5] Dirk P. Kroese,et al. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[6] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8] Rémi Munos,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[9] Csaba Szepesvári,et al. –armed Bandits , 2022 .

[10] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[13] David H. Ackley,et al. The effects of selection on noisy fitness optimization , 2011, GECCO '11.