Compact parametric models for efficient sequential decision making in high-dimensional, uncertain domains

Within artificial intelligence and robotics there is considerable interest in how a single agent can autonomously make sequential decisions in large, high-dimensional, uncertain domains. This thesis presents decision-making algorithms for maximizing the expcted sum of future rewards in two types of large, high-dimensional, uncertain situations: when the agent knows its current state but does not have a model of the world dynamics within a Markov decision process (MDP) framework, and in partially observable Markov decision processes (POMDPs), when the agent knows the dynamics and reward models, but only receives information about its state through its potentially noisy sensors. One of the key challenges in the sequential decision making field is the tradeoff between optimality and tractability. To handle high-dimensional (many variables), large (many potential values per variable) domains, an algorithm must have a computational complexity that scales gracefully with the number of dimensions. However, many prior approaches achieve such scalability through the use of heuristic methods with limited or no guarantees on how close to optimal, and under what circumstances, are the decisions made by the algorithm. Algorithms that do provide rigorous optimality bounds often do so at the expense of tractability. This thesis proposes that the use of parametric models of the world dynamics, rewards and observations can enable efficient, provably close to optimal, decision making in large, high-dimensional uncertain environments. In support of this, we present a reinforcement learning (RL) algorithm where the use of a parametric model allows the algorithm to make close to optimal decisions on all but a number of samples that scales polynomially with the dimension, a significant improvement over most prior RL provably approximately optimal algorithms. We also show that parametric models can be used to reduce the computational complexity from an exponential to polynomial dependence on the state dimension in forward search partially observable MDP planning. Under mild conditions our new forwardsearch POMDP planner maintains prior optimality guarantees on the resulting decisions. We present experimental results on a robot navigation over varying terrain RL task and a large global driving POMDP planning simulation. Thesis Supervisor: Nicholas Roy Title: Assistant Professor

[1]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2]  Panos E. Trahanias,et al.  Real-time hierarchical POMDPs for autonomous robot navigation , 2007, Robotics Auton. Syst..

[3]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[4]  Zhengzhu Feng,et al.  Dynamic Programming for Structured Continuous Markov Decision Problems , 2004, UAI.

[5]  Pascal Poupart,et al.  Point-Based Value Iteration for Continuous POMDPs , 2006, J. Mach. Learn. Res..

[6]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[7]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[8]  David E. Smith,et al.  Planning Under Continuous Time and Resource Uncertainty: A Challenge for AI , 2002, AIPS Workshop on Planning for Temporal Domains.

[9]  Brian C. Williams,et al.  Model learning for switching linear systems with autonomous mode transitions , 2007, 2007 46th IEEE Conference on Decision and Control.

[10]  Guy Shani,et al.  Efficient ADD Operations for Point-Based Algorithms , 2008, ICAPS.

[11]  Jürgen Schmidhuber,et al.  A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients , 2009, Expert Syst. Appl..

[12]  Milos Hauskrecht,et al.  Solving Factored MDPs with Exponential-Family Transition Models , 2006, ICAPS.

[13]  Joelle Pineau,et al.  Policy-contingent abstraction for robust robot control , 2002, UAI.

[14]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[15]  N. Zhang,et al.  Algorithms for partially observable markov decision processes , 2001 .

[16]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[17]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[18]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[19]  Pascal Poupart,et al.  Automated Hierarchy Discovery for Planning in Partially Observable Environments , 2006, NIPS.

[20]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[21]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[22]  Tao Wang,et al.  Bayesian sparse sampling for on-line reward optimization , 2005, ICML.

[23]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[24]  Lihong Li,et al.  Lazy Approximation for Solving Continuous Finite-Horizon MDPs , 2005, AAAI.

[25]  Michael C. Fu,et al.  Solving Continuous-State POMDPs via Density Projection , 2010, IEEE Transactions on Automatic Control.

[26]  Michael L. Littman,et al.  Perception-based generalization in model-based reinforcement learning , 2009 .

[27]  Kee-Eung Kim,et al.  Symbolic Heuristic Search Value Iteration for Factored POMDPs , 2008, AAAI.

[28]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[29]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[30]  Peter Stone,et al.  Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[31]  Stuart J. Russell,et al.  The BATmobile: Towards a Bayesian Automated Taxi , 1995, IJCAI.

[32]  Leslie Pack Kaelbling,et al.  Representing hierarchical POMDPs as DBNs for multi-scale robot localization , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[33]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[34]  Joelle Pineau,et al.  Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs , 2008, ICML '08.

[35]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[36]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[37]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[38]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[39]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[40]  Alexei Makarenko,et al.  Parametric POMDPs for planning in continuous state spaces , 2006, Robotics Auton. Syst..

[41]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[42]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[43]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[44]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[45]  Jesse Hoey,et al.  SPUDD: Stochastic Planning using Decision Diagrams , 1999, UAI.

[46]  Jesse Hoey,et al.  Solving POMDPs with Continuous or Large Discrete Observation Spaces , 2005, IJCAI.

[47]  Leslie Pack Kaelbling,et al.  Robust Belief-Based Execution of Manipulation Programs , 2008 .

[48]  David Hsu,et al.  SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces , 2008, Robotics: Science and Systems.

[49]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[50]  P. Poupart Exploiting structure to efficiently solve large scale partially observable Markov decision processes , 2005 .

[51]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[52]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[53]  Trey Smith,et al.  Probabilistic planning for robotic exploration , 2007 .

[54]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[55]  Joelle Pineau,et al.  Bayesian reinforcement learning in continuous POMDPs with application to robot navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[56]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[57]  Eric Horvitz,et al.  Prediction, Expectation, and Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting Service , 2005, UAI.

[58]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[59]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[60]  Doina Precup,et al.  Using Linear Programming for Bayesian Exploration in Markov Decision Processes , 2007, IJCAI.

[61]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[62]  Michael L. Littman,et al.  Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.

[63]  Jesse Hoey,et al.  A Decision-Theoretic Approach to Task Assistance for Persons with Dementia , 2005, IJCAI.

[64]  Thomas J. Walsh,et al.  Efficient Exploration With Latent Structure , 2005, Robotics: Science and Systems.

[65]  David A. McAllester,et al.  Approximate Planning for Factored POMDPs using Belief State Simplification , 1999, UAI.

[66]  Brahim Chaib-draa,et al.  An online POMDP algorithm for complex multiagent environments , 2005, AAMAS '05.

[67]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[68]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[69]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[70]  James T. Kwok,et al.  Simplifying Mixture Models Through Function Approximation , 2006, IEEE Transactions on Neural Networks.

[71]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[72]  Andrew G. Barto,et al.  Optimal learning: computational procedures for bayes-adaptive markov decision processes , 2002 .

[73]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[74]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[75]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[76]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[77]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[78]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[79]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[80]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[81]  Milind Tambe,et al.  Towards Faster Planning with Continuous Resources in Stochastic Domains , 2008, AAAI.

[82]  James M. Rehg,et al.  Data-Driven MCMC for Learning and Inference in Switching Linear Dynamic Systems , 2005, AAAI.

[83]  Eric A. Hansen,et al.  Synthesis of Hierarchical Finite-State Controllers for POMDPs , 2003, ICAPS.

[84]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[85]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[86]  Stuart J. Russell,et al.  Angelic Hierarchical Planning: Optimal and Online Algorithms , 2008, ICAPS.

[87]  Nate Kohl,et al.  Reinforcement Learning Benchmarks and Bake-offs II A workshop at the 2005 NIPS conference , 2005 .

[88]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[89]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[90]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[91]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[92]  John R. Hershey,et al.  Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[93]  Guy Shani,et al.  Forward Search Value Iteration for POMDPs , 2007, IJCAI.

[94]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[95]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[96]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[97]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[98]  Wolfram Burgard,et al.  Probabilistic Algorithms and the Interactive Museum Tour-Guide Robot Minerva , 2000, Int. J. Robotics Res..

[99]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[100]  Reid G. Simmons,et al.  Point-Based POMDP Algorithms: Improved Analysis and Implementation , 2005, UAI.

[101]  B. Anderson,et al.  Linear Optimal Control , 1971 .

[102]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[103]  Michael L. Littman,et al.  Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.

[104]  Carl E. Rasmussen,et al.  Model-Based Reinforcement Learning with Continuous States and Actions , 2008, ESANN.

[105]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[106]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[107]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..