Estimating the Maximum Expected Value through Upper Confidence Bound of Likelihood

Estimating the maximum expected value of a set of independent random variables is important in many domains, e.g., reinforcement learning. In this paper, we introduce a new estimator, simplified weighted estimator, and discuss its performance and that of existing estimators, both theoretically and empirically, in a multi-armed bandit setting. The estimates of our estimator are a weighted average of samples, where the weight of a sample is derived from the likelihood that the arm that yielded the sample is the optimal among all arms. Our estimator has better computational feasibility than an existing weighted estimator and higher accuracy than other existing estimators. The bias of an estimator is defined as the difference between its estimates and the ground truth value. Through theoretical analyses, we show that the bias of our estimator converges to zero as the sample size increases under a reasonable assumption, and that it is reasonably small before convergence while existing estimators have large positive or negative biases. In our experiments, we empirically show the effectiveness of our estimator in a practical setting, where samples are collected by popular strategies in reinforcement learning: UCB1, softmax, and epsilon-greedy. Estimator performances are evaluated by the degree to which the biases, variances and mean squared errors decrease with respect to the number of samples observed. For various sets of random variables, our estimator's performance is not always the best, but good in most configurations.