Planning in Reward-Rich Domains via PAC Bandits

In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0. We present several algorithms and use them to identify reliable strategies for solving screens from the video games Infinite Mario and Pitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.

[1]  H. D. Miller Combinatorial methods in the theory of stochastic processes , 1968, Computer/law journal.

[2]  O. Kallenberg Ballot theorems and Sojourn laws for stationary processes , 1999 .

[3]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[4]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Andrew W. Moore,et al.  The Racing Algorithm: Model Selection for Lazy Learners , 1997, Artificial Intelligence Review.

[7]  J. Fitzpatrick,et al.  Genetic Algorithms in Noisy Environments , 2005, Machine Learning.

[8]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[9]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[10]  L. Addario-Berry,et al.  Ballot Theorems, Old and New , 2008 .

[11]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[12]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[13]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[14]  Julian Togelius,et al.  Mario AI competition , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[15]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[16]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[17]  Julian Togelius,et al.  The 2009 Mario AI Competition , 2010, IEEE Congress on Evolutionary Computation.

[18]  Shimon Whiteson,et al.  The Reinforcement Learning Competitions , 2010 .

[19]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[20]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.