Optimism in the face of uncertainty supported by a statistically-designed multi-armed bandit algorithm

The principle of optimism in the face of uncertainty is known as a heuristic in sequential decision-making problems. Overtaking method based on this principle is an effective algorithm to solve multi-armed bandit problems. It was defined by a set of some heuristic patterns of the formulation in the previous study. The objective of the present paper is to redefine the value functions of Overtaking method and to unify the formulation of them. The unified Overtaking method is associated with upper bounds of confidence intervals of expected rewards on statistics. The unification of the formulation enhances the universality of Overtaking method. Consequently we newly obtain Overtaking method for the exponentially distributed rewards, numerically analyze it, and show that it outperforms UCB algorithm on average. The present study suggests that the principle of optimism in the face of uncertainty should be regarded as the statistics-based consequence of the law of large numbers for the sample mean of rewards and estimation of upper bounds of expected rewards, rather than as a heuristic, in the context of multi-armed bandit problems.

[1]  Moto Kamiura,et al.  Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems? , 2015, Biosyst..

[2]  Moto Kamiura,et al.  Overtaking Method based on Variance of Values: Resolving the Exploration–Exploitation Dilemma , 2013 .

[3]  James A. Shepperd,et al.  Exploring the causes of comparative optimism. , 2002 .

[4]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Timothy D. Ross,et al.  Accurate confidence intervals for binomial proportion and Poisson rate estimation , 2003, Comput. Biol. Medicine.

[9]  Nathan R. Sturtevant,et al.  An Analysis of UCT in Multi-Player Games , 2008, J. Int. Comput. Games Assoc..

[10]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[11]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[12]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[13]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[14]  R. Newcombe Two-sided confidence intervals for the single proportion: comparison of seven methods. , 1998, Statistics in medicine.

[15]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[16]  J. Reiczigel,et al.  Confidence intervals for the binomial parameter: some new considerations , 2003, Statistics in medicine.

[17]  Sean Wallis,et al.  Binomial Confidence Intervals and Contingency Tests: Mathematical Fundamentals and the Evaluation of Alternative Methods , 2013, J. Quant. Linguistics.

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..