Closing the learning-planning loop with predictive state representations

A central problem in artificial intelligence is to choose actions to maximize reward in a partially observable, uncertain environment. To do so, we must learn an accurate environment model, and then plan to maximize reward. Unfortunately, learning algorithms often recover a model that is too inaccurate to support planning or too large and complex for planning to succeed; or they require excessive prior domain knowledge or fail to provide guarantees such as statistical consistency. To address this gap, we propose a novel algorithm which provably learns a compact, accurate model directly from sequences of action-observation pairs. We then evaluate the learner by closing the loop from observations to actions. In more detail, we present a spectral algorithm for learning a predictive state representation (PSR), and evaluate it in a simulated, vision-based mobile robot planning task, showing that the learned PSR captures the essential features of the environment and enables successful and efficient planning. Our algorithm has several benefits which have not appeared together in any previous PSR learner: it is computationally efficient and statistically consistent; it handles high-dimensional observations and long time horizons; and, our close-the-loop experiments provide an end-to-end practical test.

[1]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[2]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[3]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[4]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[5]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[6]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[7]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[8]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[9]  Stefano Soatto,et al.  Dynamic Data Factorization , 2001 .

[10]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[11]  Sebastiaan A. Terwijn,et al.  On the Learnability of Hidden Markov Models , 2002, ICGI.

[12]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[13]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[14]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[15]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[16]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[17]  Nicholas K. Jong and Peter Stone Towards Employing PSRs in a Continuous Domain , 2004 .

[18]  Nikos A. Vlassis,et al.  Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[19]  Yishay Mansour,et al.  Planning in POMDPs Using Multiplicity Automata , 2005, UAI.

[20]  Eric Wiewiora,et al.  Learning predictive representations from a history , 2005, ICML.

[21]  Michael R. James,et al.  Learning predictive state representations in dynamical systems without reset , 2005, ICML.

[22]  Guy Shani,et al.  Model-Based Online Learning of POMDPs , 2005, ECML.

[23]  Michael H. Bowling,et al.  Online Discovery and Learning of Predictive State Representations , 2005, NIPS.

[24]  Nikos A. Vlassis,et al.  Improving Approximate Value Iteration Using Memories and Predictive State Representations , 2006, AAAI.

[25]  Michael H. Bowling,et al.  Learning predictive state representations using non-blind policies , 2006, ICML '06.

[26]  Joelle Pineau,et al.  Anytime Point-Based Approximations for Large POMDPs , 2006, J. Artif. Intell. Res..

[27]  Satinder P. Singh,et al.  Exponential Family Predictive Representations of State , 2007, NIPS.

[28]  Satinder P. Singh,et al.  On discovery and learning of models with predictive representations of state for agents with continuous actions and observations , 2007, AAMAS '07.

[29]  Satinder P. Singh,et al.  Efficiently learning linear-linear exponential family predictive representations of state , 2008, ICML '08.

[30]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[31]  Doina Precup,et al.  Point-Based Planning for Predictive State Representations , 2008, Canadian Conference on AI.

[32]  Herbert Jaeger,et al.  A Bound on Modeling Error in Observable Operator Models and an Associated Learning Algorithm , 2009, Neural Computation.

[33]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[34]  Russ Tedrake Learning to Fly like a Bird , 2009 .

[35]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[36]  Bart De Moor,et al.  Subspace Identification for Linear Systems: Theory ― Implementation ― Applications , 2011 .