Finding Approximate POMDP solutions Through Belief Compression

Recent research in the field of robotics has demonstrated the utility of probabilistic models for perception and state tracking on deployed robot systems. For example, Kalman filters and Markov localisation have been used successfully in many robot applications (Leonard & Durrant-Whyte, 1991; Thrun et al., 2000). There has also been considerable research into control and decision making algorithms that are robust in the face of specific kinds of uncertainty (Bagnell & Schneider, 2001). Few control algorithms, however, make use of full probabilistic representations throughout planning. As a consequence, robot control can become increasingly brittle as the system's perceptual uncertainty, and state uncertainty, increase. This thesis addresses the problem of decision making under uncertainty. In particular, we use a planning model called the partially observable Markov decision process, or POMDP (Sondik, 1971). The POMDP model computes a policy that maximises the expected future reward based on the complete probabilistic state estimate, or belief. Unfortunately, finding an optimal policy exactly is computationally demanding and thus infeasible for most problems that represent real world scenarios. This thesis describes a scalable approach to POMDP planning which uses low-dimensional representations of the belief space. We demonstrate how to make use of a variant of Principal Components Analysis (PCA) called Exponential family PCA (Collins et al., 2002) in order to compress certain kinds of large real-world POMDPs, and find policies for these problems. By finding low-dimensional representations of POMDPS, we are able to find good policies for problems that are orders of magnitude larger than those solvable by conventional approaches.

[1]  A. Householder,et al.  Discussion of a set of points in terms of their mutual distances , 1938 .

[2]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[3]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[4]  R. Bellman Dynamic programming. , 1957, Science.

[5]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[6]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[7]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[8]  H. Akaike A new look at the statistical model identification , 1974 .

[9]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Stochastic Control , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  Lawrence R. Rabiner,et al.  A tutorial on Hidden Markov Models , 1986 .

[14]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[15]  Marcel Schoppers,et al.  Universal Plans for Reactive Robots in Unpredictable Environments , 1987, IJCAI.

[16]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[17]  Hsien-Te Cheng,et al.  Algorithms for partially observable markov decision processes , 1989 .

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  Ingemar J. Cox,et al.  Autonomous Robot Vehicles , 1990, Springer New York.

[20]  Sheryl R. YOUNG,et al.  Use of dialogue, pragmatics and sematics to enhance speech recognition , 1990, Speech Commun..

[21]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[22]  Jean-Claude Latombe,et al.  Robot motion planning , 1991, The Kluwer international series in engineering and computer science.

[23]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[24]  Hugh F. Durrant-Whyte,et al.  Mobile robot localization by tracking geometric beacons , 1991, IEEE Trans. Robotics Autom..

[25]  D. Moore Simplicial Mesh Generation with Applications , 1992 .

[26]  John A. Nelder,et al.  Generalized linear models. 2nd ed. , 1993 .

[27]  Jean-Claude Latombe,et al.  Planning the Motions of a Mobile Robot in a Sensory Uncertainty Field , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Gregory Dudek,et al.  Precise positioning using model-based maps , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[29]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[30]  Mark C. Torrance,et al.  Natural communication with robots , 1994 .

[31]  R. Simmons,et al.  Probabilistic Navigation in Partially Observable Environments , 1995 .

[32]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[33]  Stuart J. Russell,et al.  Stochastic simulation algorithms for dynamic probabilistic networks , 1995, UAI.

[34]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[35]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[36]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning , 1995 .

[37]  Reid G. Simmons,et al.  Probabilistic Robot Navigation in Partially Observable Environments , 1995, IJCAI.

[38]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[39]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[40]  Illah R. Nourbakhsh,et al.  DERVISH - An Office-Navigating Robot , 1995, AI Mag..

[41]  Mosur Ravishankar,et al.  Efficient Algorithms for Speech Recognition. , 1996 .

[42]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[43]  Scott Davies,et al.  Multidimensional Triangulation and Interpolation for Reinforcement Learning , 1996, NIPS.

[44]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[45]  Wolfram Burgard,et al.  Estimating the Absolute Position of a Mobile Robot Using Position Probability Grids , 1996, AAAI/IAAI, Vol. 2.

[46]  Lydia E. Kavraki,et al.  Probabilistic roadmaps for path planning in high-dimensional configuration spaces , 1996, IEEE Trans. Robotics Autom..

[47]  Liqiang Feng,et al.  Navigating Mobile Robots: Systems and Techniques , 1996 .

[48]  Craig Boutilier,et al.  Computing Optimal Policies for Partially Observable Decision Processes Using Compact Representations , 1996, AAAI/IAAI, Vol. 2.

[49]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[50]  Gregory Dudek,et al.  Vision-based robot localization without explicit object models , 1996, Proceedings of IEEE International Conference on Robotics and Automation.

[51]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[52]  Yasuhisa Niimi,et al.  A dialog control strategy based on the reliability of speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[53]  Alexander H. Waibel,et al.  Dialogue strategies guiding users to their communicative goals , 1997, EUROSPEECH.

[54]  Richard Washington,et al.  BI-POMDP: Bounded, Incremental, Partially-Observable Markov-Model Planning , 1997, ECP.

[55]  B. S. Manjunath,et al.  An Eigenspace Update Algorithm for Image Analysis , 1997, CVGIP Graph. Model. Image Process..

[56]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[57]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[58]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[59]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[60]  Wolfram Burgard,et al.  A Probabilistic Approach to Concurrent Mapping and Localization for Mobile Robots , 1998, Auton. Robots.

[61]  Wolfram Burgard,et al.  Position Estimation for Mobile Robots in Dynamic Environments , 1998, AAAI/IAAI.

[62]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[63]  Wolfram Burgard,et al.  An experimental comparison of localization methods , 1998, Proceedings. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems. Innovations in Theory, Practice and Applications (Cat. No.98CH36190).

[64]  Wolfram Burgard,et al.  Active Markov localization for mobile robots , 1998, Robotics Auton. Syst..

[65]  Gregory Dudek,et al.  Mobile robot localization from learned landmarks , 1998, Proceedings. 1998 IEEE/RSJ International Conference on Intelligent Robots and Systems. Innovations in Theory, Practice and Applications (Cat. No.98CH36190).

[66]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[67]  Eric A. Hansen,et al.  Solving POMDPs by Searching in Policy Space , 1998, UAI.

[68]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[69]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[70]  Hermann Ney,et al.  Evaluating dialog systems used in the real world , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[71]  Roberto Pieraccini,et al.  Using Markov decision process for learning dialogue strategies , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[72]  Alexander H. Waibel,et al.  Towards spontaneous speech recognition for on-board car navigation and information systems , 1999, EUROSPEECH.

[73]  Andrew W. Moore,et al.  Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems , 1999, IJCAI.

[74]  Kee-Eung Kim,et al.  Solving POMDPs by Searching the Space of Finite Policies , 1999, UAI.

[75]  Antal van den Bosch Instance-Family Abstraction in Memory-Based Language Learning , 1999, ICML.

[76]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[77]  Wolfram Burgard,et al.  Coastal navigation-mobile robot navigation with uncertainty in dynamic environments , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).

[78]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[79]  W. Burgard,et al.  Markov Localization for Mobile Robots in Dynamic Environments , 1999, J. Artif. Intell. Res..

[80]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[81]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[82]  Shimei Pan,et al.  Empirically Evaluating an Adaptable Spoken Dialogue System , 1999, ArXiv.

[83]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[84]  Sebastian Thrun,et al.  Coastal Navigation with Mobile Robots , 1999, NIPS.

[85]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[86]  Sebastian Thrun,et al.  Monte Carlo POMDPs , 1999, NIPS.

[87]  Wolfram Burgard,et al.  Experiences with an Interactive Museum Tour-Guide Robot , 1999, Artif. Intell..

[88]  Kurt Konolige,et al.  A gradient method for realtime robot control , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[89]  Clark F. Olson,et al.  Probabilistic self-localization for mobile robots , 2000, IEEE Trans. Robotics Autom..

[90]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[91]  Wolfram Burgard,et al.  Probabilistic Algorithms and the Interactive Museum Tour-Guide Robot Minerva , 2000, Int. J. Robotics Res..

[92]  Marilyn A. Walker,et al.  Automatic Optimization of Dialogue Management , 2000, COLING.

[93]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[94]  Milos Hauskrecht,et al.  Value-Function Approximations for Partially Observable Markov Decision Processes , 2000, J. Artif. Intell. Res..

[95]  Alex Pentland,et al.  EM for Perceptual Coding and Reinforcement Learning Tasks , 2000 .

[96]  Ben M. Chen Robust and H[∞] control , 2000 .

[97]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[98]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[99]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[100]  Zhengzhu Feng,et al.  Dynamic Programming for POMDPs Using a Factored State Representation , 2000, AIPS.

[101]  Weihong Zhang,et al.  Speeding Up the Convergence of Value Iteration in Partially Observable Markov Decision Processes , 2011, J. Artif. Intell. Res..

[102]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[103]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[104]  Geoffrey E. Hinton,et al.  Global Coordination of Local Linear Models , 2001, NIPS.

[105]  Jeff G. Schneider,et al.  Autonomous helicopter control using reinforcement learning policy search methods , 2001, Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164).

[106]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[107]  Wolfram Burgard,et al.  Robust Monte Carlo localization for mobile robots , 2001, Artif. Intell..

[108]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[109]  S. LaValle,et al.  Randomized Kinodynamic Planning , 2001 .

[110]  Mukund Balasubramanian,et al.  The isomap algorithm and topological stability. , 2002, Science.

[111]  L. P. Kaelbling,et al.  Learning Geometrically-Constrained Hidden Markov Models for Robot Navigation: Bridging the Topological-Geometrical Gap , 2011, J. Artif. Intell. Res..

[112]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[113]  David Andre,et al.  State abstraction for programmable reinforcement learning agents , 2002, AAAI/IAAI.

[114]  I. Jolliffe Principal Component Analysis , 2002 .

[115]  Geoffrey J. Gordon Generalized^2 Linear^2 Models , 2002, NIPS 2002.

[116]  Leslie Pack Kaelbling,et al.  Learning Geometrically-Constrained Hidden Markov Models for Robot Navigation: Bridging the Geometrical-Topological Gap , 2002 .

[117]  Nicholas Roy,et al.  Exponential Family PCA for Belief Compression in POMDPs , 2002, NIPS.

[118]  William Whittaker,et al.  Conditional particle filters for simultaneous mobile robot localization and people-tracking , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[119]  Dieter Fox,et al.  An experimental comparison of localization methods continued , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[120]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[121]  Sebastian Thrun,et al.  Motion planning through policy search , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[122]  Sridhar Mahadevan,et al.  Hierarchical learning and planning in partially observable markov decision processes , 2002 .

[123]  Joelle Pineau,et al.  Policy-contingent abstraction for robust robot control , 2002, UAI.

[124]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[125]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[126]  Andrew W. Moore,et al.  The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces , 1993, Machine Learning.

[127]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[128]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[129]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[130]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[131]  R. Simmons,et al.  The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms , 2004, Machine Learning.

[132]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[133]  M. E. Galassi,et al.  GNU SCIENTI C LIBRARY REFERENCE MANUAL , 2005 .

[134]  RoyNicholas,et al.  Finding approximate POMDP solutions through belief compression , 2005 .

[135]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[136]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.