Vulcan: A Monte Carlo Algorithm for Large Chance Constrained MDPs with Risk Bounding Functions

Chance Constrained Markov Decision Processes maximize reward subject to a bounded probability of failure, and have been frequently applied for planning with potentially dangerous outcomes or unknown environments. Solution algorithms have required strong heuristics or have been limited to relatively small problems with up to millions of states, because the optimal action to take from a given state depends on the probability of failure in the rest of the policy, leading to a coupled problem that is difficult to solve. In this paper we examine a generalization of a CCMDP that trades off probability of failure against reward through a functional relationship. We derive a constraint that can be applied to each state history in a policy individually, and which guarantees that the chance constraint will be satisfied. The approach decouples states in the CCMDP, so that large problems can be solved efficiently. We then introduce Vulcan, which uses our constraint in order to apply Monte Carlo Tree Search to CCMDPs. Vulcan can be applied to problems where it is unfeasible to generate the entire state space, and policies must be returned in an anytime manner. We show that Vulcan and its variants run tens to hundreds of times faster than linear programming methods, and over ten times faster than heuristic based methods, all without the need for a heuristic, and returning solutions with a mean suboptimality on the order of a few percent. Finally, we use Vulcan to solve for a chance constrained policy in a CCMDP with over $10^{13}$ states in 3 minutes.

[1]  E. Altman Constrained Markov Decision Processes , 1999 .

[2]  Sarit Kraus,et al.  Towards a formalization of teamwork with resource constraints , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[3]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Sylvie Thiébaux,et al.  RAO*: An Algorithm for Chance-Constrained POMDP's , 2016, AAAI.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Jonathan P. How,et al.  An online algorithm for constrained POMDPs , 2010, 2010 IEEE International Conference on Robotics and Automation.

[8]  S. Marcus,et al.  Approximate receding horizon approach for Markov decision processes: average reward case , 2003 .

[9]  Masahiro Ono,et al.  Joint chance-constrained dynamic programming , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[10]  Johannes Bisschop,et al.  AIMMS - Optimization Modeling , 2006 .

[11]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[12]  Daniel P. Heyman,et al.  Stochastic models in operations research , 1982 .

[13]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[14]  L. Rossman Reliability‐constrained dynamic programing and randomized release rules in reservoir management , 1977 .

[15]  Raymond L. Smith,et al.  Rolling Horizon Procedures in Nonhomogeneous Markov Decision Processes , 1992, Oper. Res..

[16]  Masahiro Ono,et al.  Paper Summary: Probabilistic Planning for Continuous Dynamic Systems under Bounded Risk , 2013, ICAPS.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[19]  Jason D. Williams Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions , 2011 .

[20]  Edmund H. Durfee,et al.  Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors , 2005, IJCAI.

[21]  F. B. Hildebrand,et al.  Introduction To Numerical Analysis , 1957 .