论文信息 - Estimating Maximum Expected Value through Gaussian Approximation

Estimating Maximum Expected Value through Gaussian Approximation

This paper is about the estimation of the maximum expected value of a set of independent random variables. The performance of several learning algorithms (e.g., Q-learning) is affected by the accuracy of such estimation. Unfortunately, no unbiased estimator exists. The usual approach of taking the maximum of the sample means leads to large overestimates that may significantly harm the performance of the learning algorithm. Recent works have shown that the cross validation estimator--which is negatively biased--outperforms the maximum estimator in many sequential decision-making scenarios. On the other hand, the relative performance of the two estimators is highly problem-dependent. In this paper, we propose a new estimator for the maximum expected value, based on a weighted average of the sample means, where the weights are computed using Gaussian approximations for the distributions of the sample means. We compare the proposed estimator with the other state-of-the-art methods both theoretically, by deriving upper bounds to the bias and the variance of the estimator, and empirically, by testing the performance on different sequential learning problems.

[1] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[2] D. BhaeiyalIshwaei,et al. Non-existence of unbiased estimators of ordered parameters , 1985 .

[3] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[5] Tao Qin,et al. Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising , 2013, NIPS.

[6] Warren B. Powell,et al. An Intelligent Battery Controller Using Bias-Corrected Q-learning , 2012, AAAI.

[7] A. Cohen,et al. ESTIMATION OF THE LARGER OF TWO NORMAL MEANS , 1968 .

[8] Warren B. Powell,et al. Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[9] Hado van Hasselt,et al. Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average , 2013, ArXiv.

[10] E. Steen. Rational Overoptimism (and Other Biases) , 2004 .

[11] Robert L. Winkler,et al. The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[12] M. Stone. Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .