Closing the learning-planning loop with predictive state representations

A central problem in artificial intelligence is to plan to maximize future reward under uncertainty in a partially observable environment. Models of such environments include Partially Observable Markov Decision Processes (POMDPs) [4] as well as their generalizations, Predictive State Representations (PSRs) [9] and Observable Operator Models (OOMs) [7]. POMDPs model the state of the world as a latent variable; in contrast, PSRs and OOMs represent state by tracking occurrence probabilities of a set of future events (called tests or characteristic events) conditioned on past events (called histories or indicative events). Unfortunately, exact planning algorithms such as value iteration [14] are intractable for most realistic POMDPs due to the curse of history and the curse of dimensionality [11]. However, PSRs and OOMs hold the promise of mitigating both of these curses: first, many successful approximate planning techniques designed to address these problems in POMDPs can easily be adapted to PSRs and OOMs [8, 6]. Second, PSRs and OOMs are often more compact than their corresponding POMDPs (i.e., need fewer state dimensions), mitigating the curse of dimensionality. Finally, since tests and histories are observable quantities, it has been suggested that PSRs and OOMs should be easier to learn than POMDPs; with a successful learning algorithm, we can look for a model which ignores all but the most important components of state, reducing dimensionality still further. In this paper we take an important step toward realizing the above hopes. In particular, we propose and demonstrate a fast and statistically consistent spectral algorithm which learns the parameters of a PSR directly from sequences of action-observation pairs. We then close the loop from observations to actions by planning in the learned model and recovering a policy which is near-optimal in the original environment. Closing the loop is a much more stringent test than simply checking short-term prediction accuracy, since the quality of an optimized policy depends strongly on the accuracy of the model: inaccurate models typically lead to useless plans.

[1]  Yishay Mansour,et al.  Planning in POMDPs Using Multiplicity Automata , 2005, UAI.

[2]  Satinder P. Singh,et al.  Exponential Family Predictive Representations of State , 2007, NIPS.

[3]  Nikos A. Vlassis,et al.  Improving Approximate Value Iteration Using Memories and Predictive State Representations , 2006, AAAI.

[4]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[5]  Michael R. James,et al.  Learning and discovery of predictive state representations in dynamical systems with reset , 2004, ICML.

[6]  Satinder P. Singh,et al.  On discovery and learning of models with predictive representations of state for agents with continuous actions and observations , 2007, AAMAS '07.

[7]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[8]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[9]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[10]  Michael H. Bowling,et al.  Online Discovery and Learning of Predictive State Representations , 2005, NIPS.

[11]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2011, Int. J. Robotics Res..

[12]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[13]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[14]  Michael R. James,et al.  Learning predictive state representations in dynamical systems without reset , 2005, ICML.

[15]  Stefano Soatto,et al.  Dynamic Data Factorization , 2001 .

[16]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[17]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[18]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[19]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[20]  Herbert Jaeger,et al.  A Bound on Modeling Error in Observable Operator Models and an Associated Learning Algorithm , 2009, Neural Computation.

[21]  Satinder P. Singh,et al.  Efficiently learning linear-linear exponential family predictive representations of state , 2008, ICML '08.

[22]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[23]  Bart De Moor,et al.  Subspace Identification for Linear Systems: Theory ― Implementation ― Applications , 2011 .

[24]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[25]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[26]  Eric Wiewiora,et al.  Learning predictive representations from a history , 2005, ICML.

[27]  Michael H. Bowling,et al.  Learning predictive state representations using non-blind policies , 2006, ICML '06.

[28]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[29]  Nicholas K. Jong and Peter Stone Towards Employing PSRs in a Continuous Domain , 2004 .

[30]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[31]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[32]  Doina Precup,et al.  Point-Based Planning for Predictive State Representations , 2008, Canadian Conference on AI.

[33]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .