Optimistic Planning for the Stochastic Knapsack Problem

The stochastic knapsack problem is a stochastic resource allocation problem that arises frequently and yet is exceptionally hard to solve. We derive and study an optimistic planning algorithm specifically designed for the stochastic knapsack problem. Unlike other optimistic planning algorithms for MDPs, our algorithm, OpStoK, avoids the use of discounting and is adaptive to the amount of resources available. We achieve this behavior by means of a concentration inequality that simultaneously applies to capacity and reward estimates. Crucially, we are able to guarantee that the aforementioned confidence regions hold collectively over all time steps by an application of Doob’s inequality. We demonstrate that the method returns an ee-optimal solution to the stochastic knapsack problem with high probability. To the best of our knowledge, our algorithm is the first which provides such guarantees for the stochastic knapsack problem. Furthermore, our algorithm is an anytime algorithm and will return a good solution even if stopped prematurely. This is particularly important given the difficulty of the problem. We also provide theoretical conditions to guarantee OpStoK does not expand all policies and demonstrate favorable performance in a simple experimental setting.

[1]  R. K. Wood,et al.  On a stochastic knapsack problem and generalizations , 1997 .

[2]  J. Doob Stochastic processes , 1953 .

[3]  Ashish Sabharwal,et al.  Guiding Combinatorial Optimization with UCT , 2012, CPAIOR.

[4]  A. Burnetas,et al.  ASYMPTOTICALLY OPTIMAL MULTI-ARMED BANDIT POLICIES UNDER A COST CONSTRAINT , 2015, Probability in the Engineering and Informational Sciences.

[5]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[6]  Peter L. Bartlett,et al.  Improved Learning Complexity in Combinatorial Pure Exploration Bandits , 2016, AISTATS.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Alessandro Lazaric,et al.  Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[9]  E. Steinberg,et al.  A Preference Order Dynamic Program for a Knapsack Problem with Stochastic Rewards , 1979 .

[10]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[11]  Ashish Goel,et al.  Improved approximation results for stochastic knapsack problems , 2011, SODA '11.

[12]  Rémi Munos,et al.  Optimistic Planning in Markov Decision Processes Using a Generative Model , 2014, NIPS.

[13]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[14]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[15]  Wei Chen,et al.  Combinatorial Pure Exploration of Multi-Armed Bandits , 2014, NIPS.

[16]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[17]  Aleksandrs Slivkins,et al.  Bandits with Knapsacks , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[18]  G. Dantzig Discrete-Variable Extremum Problems , 1957 .

[19]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[20]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[21]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[22]  J. Vondrák,et al.  Approximating the Stochastic Knapsack Problem: The Benefit of Adaptivity , 2008 .

[23]  Vianney Perchet,et al.  Batched Bandit Problems , 2015, COLT.

[24]  David Williams,et al.  Probability with Martingales , 1991, Cambridge mathematical textbooks.