The Max K-Armed Bandit: A New Model of Exploration Applied to Search Heuristic Selection

The multiarmed bandit is often used as an analogy for the tradeoff between exploration and exploitation in search problems. The classic problem involves allocating trials to the arms of a multiarmed slot machine to maximize the expected sum of rewards. We pose a new variation of the multiarmed bandit--the Max K-Armed Bandit--in which trials must be allocated among the arms to maximize the expected best single sample reward of the series of trials. Motivation for the Max K-Armed Bandit is the allocation of restarts among a set of multistart stochastic search algorithms. We present an analysis of this Max K-Armed Bandit showing under certain assumptions that the optimal strategy allocates trials to the observed best arm at a rate increasing double exponentially relative to the other arms. This motivates an exploration strategy that follows a Boltzmann distribution with an exponentially decaying temperature parameter. We compare this exploration policy to policies that allocate trials to the observed best arm at rates faster (and slower) than double exponentially. The results confirm, for two scheduling domains, that the double exponential increase in the rate of allocations to the observed best heuristic outperfonns the other approaches.

[1]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[2]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[3]  Upendra Dave,et al.  Heuristic Scheduling Systems , 1993 .

[4]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  Klaus Neumann,et al.  Truncated branch-and-bound, schedule-construction, and schedule-improvement procedures for resource-constrained project scheduling , 2001, OR Spectr..

[7]  Stephen F. Smith,et al.  A Constraint-Based Method for Project Scheduling with Time Windows , 2002, J. Heuristics.

[8]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[9]  Chris N. Potts,et al.  An Iterated Dynasearch Algorithm for the Single-Machine Total Weighted Tardiness Scheduling Problem , 2002, INFORMS J. Comput..

[10]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Tristan B. Smith,et al.  An Effective Algorithm for Project Scheduling with Arbitrary Temporal Constraints , 2004, AAAI.

[13]  Stephen F. Smith,et al.  Heuristic Selection for Stochastic Search Optimization: Modeling Solution Quality by Extreme Value Theory , 2004, CP.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Stephen F. Smith,et al.  Enhancing Stochastic Search Performance by Value-Biased Randomization of Heuristics , 2005, J. Heuristics.