论文信息 - Estimating the Maximum Expected Value through Upper Confidence Bound of Likelihood

Estimating the Maximum Expected Value through Upper Confidence Bound of Likelihood

Estimating the maximum expected value of a set of independent random variables is important in many domains, e.g., reinforcement learning. In this paper, we introduce a new estimator, simplified weighted estimator, and discuss its performance and that of existing estimators, both theoretically and empirically, in a multi-armed bandit setting. The estimates of our estimator are a weighted average of samples, where the weight of a sample is derived from the likelihood that the arm that yielded the sample is the optimal among all arms. Our estimator has better computational feasibility than an existing weighted estimator and higher accuracy than other existing estimators. The bias of an estimator is defined as the difference between its estimates and the ground truth value. Through theoretical analyses, we show that the bias of our estimator converges to zero as the sample size increases under a reasonable assumption, and that it is reasonably small before convergence while existing estimators have large positive or negative biases. In our experiments, we empirically show the effectiveness of our estimator in a practical setting, where samples are collected by popular strategies in reinforcement learning: UCB1, softmax, and epsilon-greedy. Estimator performances are evaluated by the degree to which the biases, variances and mean squared errors decrease with respect to the number of samples observed. For various sets of random variables, our estimator's performance is not always the best, but good in most configurations.

Tomoyuki Kaneko | Takahisa Imagawa | Tomoyuki Kaneko | Takahisa Imagawa

[1] Yngvi Björnsson,et al. CadiaPlayer: A Simulation-Based General Game Player , 2009, IEEE Transactions on Computational Intelligence and AI in Games.

[2] T. Aven. Upper (lower) bounds on the mean of the maximum (minimum) of a number of random variables , 1985, Journal of Applied Probability.

[3] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4] Julian Togelius,et al. Monte Mario: platforming with MCTS , 2014, GECCO.

[5] W. Hoeffding. Probability Inequalities for sums of Bounded Random Variables , 1963 .

[6] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[7] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[8] Tao Qin,et al. Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Hado van Hasselt,et al. Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average , 2013, ArXiv.

[11] R. Munos,et al. Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[12] Marcello Restelli,et al. Estimating Maximum Expected Value through Gaussian Approximation , 2016, ICML.

[13] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.