Planning and Learning with Stochastic Action Sets

In many practical uses of reinforcement learning (RL) the set of actions available at a given state is a random variable, with realizations governed by an exogenous stochastic process. Somewhat surprisingly, the foundations for such sequential decision processes have been unaddressed. In this work, we formalize and investigate MDPs with stochastic action sets (SAS-MDPs) to provide these foundations. We show that optimal policies and value functions in this model have a structure that admits a compact representation. From an RL perspective, we show that Q-learning with sampled action sets is sound. In model-based settings, we consider two important special cases: when individual actions are available with independent probabilities, and a sampling-based model for unknown distributions. We develop polynomial-time value and policy iteration methods for both cases, and provide a polynomial-time linear programming solution for the first case.

[1]  U. Meister,et al.  A polynomial time bound for Howard's policy improvement algorithm , 1986 .

[2]  J. G. Pierce,et al.  Geometric Algorithms and Combinatorial Optimization , 2016 .

[3]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[4]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  John N. Tsitsiklis,et al.  Stochastic shortest path problems with recourse , 1996, Networks.

[7]  Ravi Kumar,et al.  On targeting Markov segments , 1999, STOC '99.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  David R. Karger,et al.  Route Planning under Uncertainty: The Canadian Traveller Problem , 2008, AAAI.

[10]  Varun Kanade,et al.  Sleeping Experts and Bandits with Stochastic Action Availability and Adversarial Rewards , 2009, AISTATS.

[11]  Zheng Chen,et al.  A Markov chain model for integrating behavioral targeting into contextual advertising , 2009, KDD Workshop on Data Mining and Audience Intelligence for Advertising.

[12]  Vahab S. Mirrokni,et al.  Mining advertiser-specific user behavior using adfactors , 2010, WWW '10.

[13]  Robert D. Kleinberg,et al.  Regret bounds for sleeping experts and bandits , 2010, Machine Learning.

[14]  Yinyu Ye,et al.  The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate , 2011, Math. Oper. Res..

[15]  Anton Schwaighofer,et al.  Budget Optimization for Sponsored Search: Censored Learning in MDPs , 2012, UAI.

[16]  Vahab S. Mirrokni,et al.  Budget Optimization for Online Campaigns with Positive Carryover Effects , 2012, WINE.

[17]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[18]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[19]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[20]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Craig Boutilier,et al.  Logistic Markov Decision Processes , 2017, IJCAI.