Robust planning in domains with stochastic outcomes, adversaries, and partial observability

Real-world planning problems often feature multiple sources of uncertainty, including randomness in outcomes, the presence of adversarial agents, and lack of complete knowledge of the world state. This thesis describes algorithms for four related formal models that can address multiple types of uncertainty: Markov decision processes, MDPs with adversarial costs, extensive-form games, and a new class of games that includes both extensive-form games and MDPs as special cases. Markov decision processes can represent problems where actions have stochastic outcomes. We describe several new algorithms for MDPs, and then show how MDPs can be generalized to model the presence of an adversary that has some control over costs. Extensive-form games can model games with random events and partial observability. In the zero-sum perfect-recall case, a minimax solution can be found in time polynomial in the size of the game tree. However, the game tree must "remember" all past actions and random outcomes, and so the size of the game tree grows exponentially in the length of the game. This thesis introduces a new generalization of extensive-form games that relaxes this need to remember all past actions exactly, producing exponentially smaller representations for interesting problems. Further, this formulation unifies extensive-form games with MOP planning. We present a new class of fast anytime algorithms for the off-line computation of minimax equilibria in both traditional and generalized extensive-form games. Experimental results demonstrate their effectiveness on an adversarial MDP problem and on a large abstracted poker game. We also present a new algorithm for playing repeated extensive-form games that can be used when only the total payoff of the game is observed on each round.

[1]  S. Karlin,et al.  SOLUTIONS OF CONVEX GAMES AS FIXED-POINTS, , 1951 .

[2]  N. Dalkey EQUIVALENCE OF INFORMATION PATTERNS AND ESSENTIALLY DETERMINATE GAMES , 1952 .

[3]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[4]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[5]  J. F. Benders Partitioning procedures for solving mixed-variables programming problems , 1962 .

[6]  R. Selten Reexamination of the perfectness concept for equilibrium points in extensive games , 1975, Classics in Game Theory.

[7]  Hanif D. Sherali,et al.  Linear Programming and Network Flows , 1977 .

[8]  I. Duff,et al.  Direct Methods for Sparse Matrices , 1987 .

[9]  Baruch Schieber,et al.  The Canadian Traveller Problem , 1991, SODA '91.

[10]  Mihalis Yannakakis,et al.  Shortest Paths Without a Map , 1989, Theor. Comput. Sci..

[11]  Philip J. Reny,et al.  Rationality in Extensive-Form Games , 1992 .

[12]  D. Koller,et al.  The complexity of two-person zero-sum games in extensive form , 1992 .

[13]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[14]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[15]  Reinhard Selten,et al.  Multistage Game Models and Delay Supergames , 1994 .

[16]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[17]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[18]  Bernhard von Stengel,et al.  Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[19]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[20]  Leslie Pack Kaelbling,et al.  Planning under Time Constraints in Stochastic Domains , 1993, Artif. Intell..

[21]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[22]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[23]  Avrim Blum,et al.  Fast Planning Through Planning Graph Analysis , 1995, IJCAI.

[24]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[26]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[27]  B. Stengel,et al.  COMPUTING EQUILIBRIA FOR TWO-PERSON GAMES , 1996 .

[28]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[29]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[30]  Ariel Rubinstein,et al.  On the Interpretation of Decision Problems with Imperfect Recall , 1996, TARK.

[31]  David Andre,et al.  Generalized Prioritized Sweeping , 1997, NIPS.

[32]  Amedeo Cesta,et al.  Recent Advances in AI Planning , 1997, Lecture Notes in Computer Science.

[33]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[34]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[35]  R. A. Park Shortest Paths in a Dynamic Uncertain Domain , 1999 .

[36]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[37]  Manuela Veloso,et al.  An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning , 2000 .

[38]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[39]  Michael L. Littman,et al.  Abstraction Methods for Game Theoretic Poker , 2000, Computers and Games.

[40]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[41]  Sebastian Thrun,et al.  A system for multi-agent coordination in uncertain environments , 2001, AGENTS '01.

[42]  Sven Koenig,et al.  Incremental A* , 2001, NIPS.

[43]  M. Rosencrantz,et al.  Locating Moving Entities in Dynamic Indoor Environments with Teams of Mobile Robots , 2002 .

[44]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[45]  Anshul Gupta,et al.  Recent advances in direct methods for solving unsymmetric sparse systems of linear equations , 2002, TOMS.

[46]  Koby Crammer,et al.  Advances in Neural Information Processing Systems 14 , 2002 .

[47]  William H. Press,et al.  Numerical recipes in C , 2002 .

[48]  Adam Tauman Kalai,et al.  Geometric algorithms for online optimization , 2002 .

[49]  Manfred K. Warmuth,et al.  Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[50]  Edith Cohen,et al.  Making intra-domain routing robust to changing and uncertain traffic demands: understanding fundamental tradeoffs , 2003, SIGCOMM '03.

[51]  Edith Cohen,et al.  Optimal oblivious routing in polynomial time , 2003, STOC '03.

[52]  Blai Bonet,et al.  Faster Heuristic Search Algorithms for Planning with Uncertainty and Full Feedback , 2003, IJCAI.

[53]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[54]  H. Brendan McMahan,et al.  Planning in Cost-Paired Markov Decision Process Games , 2003 .

[55]  Adam Meyerson,et al.  Online oblivious routing , 2003, SPAA '03.

[56]  Marcin Bienkowski,et al.  A practical algorithm for constructing oblivious routing schemes , 2003, SPAA '03.

[57]  Jonathan Schaeffer,et al.  Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[58]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[59]  Sebastian Thrun,et al.  Locating moving entities in indoor environments with teams of mobile robots , 2003, AAMAS '03.

[60]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[61]  Sylvain Sorin,et al.  Stochastic Games and Applications , 2003 .

[62]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1995, Machine Learning.

[63]  Anthony Stentz,et al.  Focused Dynamic Programming: Extensive Comparative Results , 2004 .

[64]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[65]  R. Ravi,et al.  Boosted sampling: approximation algorithms for stochastic optimization , 2004, STOC '04.

[66]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[67]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[68]  Giacomo Bonanno,et al.  Memory and perfect recall in extensive games , 2004, Games Econ. Behav..

[69]  R. Ravi,et al.  Hedging Uncertainty: Approximation Algorithms for Stochastic Optimization Problems , 2004, Math. Program..

[70]  Sebastian Thrun,et al.  Planning for Markov Decision Processes with Sparse Stochasticity , 2004, NIPS.

[71]  Jeff G. Schneider,et al.  Approximate solutions for partially observable stochastic games with common payoffs , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[72]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[73]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[74]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[75]  Nicole Immorlica,et al.  On the costs and benefits of procrastination: approximation algorithms for stochastic combinatorial optimization problems , 2004, SODA '04.

[76]  Sebastian Thrun,et al.  PAO for planning with hidden state , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[77]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[78]  No-Regret Algorithms for Structured Prediction Problems , 2005 .

[79]  Geoffrey J. Gordon,et al.  Generalizing Dijkstra's Algorithm and Gaussian Elimination for Solving MDPs , 2005 .

[80]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[81]  Jacques F. Benders,et al.  Partitioning procedures for solving mixed-variables programming problems , 2005, Comput. Manag. Sci..

[82]  Tuomas Sandholm,et al.  Optimal Rhode Island Hold'em Poker , 2005, AAAI.

[83]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[84]  Geoffrey J. Gordon,et al.  Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees , 2005, ICML.

[85]  Mohit Singh,et al.  How to pay, come what may: approximation algorithms for demand-robust covering problems , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[86]  Geoffrey J. Gordon,et al.  Fast Exact Planning in Markov Decision Processes , 2005, ICAPS.

[87]  J. M. Bilbao,et al.  Contributions to the Theory of Games , 2005 .

[88]  Tuomas Sandholm,et al.  A Texas Hold'em poker player based on automated abstraction and real-time equilibrium computation , 2006, AAMAS '06.

[89]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[90]  Andrew J. Schaefer,et al.  SPAR: stochastic programming with adversarial recourse , 2006, Oper. Res. Lett..

[91]  David S. Leslie,et al.  Generalised weakened fictitious play , 2006, Games Econ. Behav..

[92]  Peter Bro Miltersen,et al.  Computing sequential equilibria for two-player games , 2006, SODA '06.

[93]  Tuomas Sandholm,et al.  Finding equilibria in large sequential games of imperfect information , 2006, EC '06.

[94]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.