Predictive State Temporal Difference Learning

We propose a new approach to value function approximation which combines linear temporal difference reinforcement learning with subspace identification. In practical applications, reinforcement learning (RL) is complicated by the fact that state is either high-dimensional or partially observable. Therefore, RL methods are designed to work with features of state rather than state itself, and the success or failure of learning is often determined by the suitability of the selected features. By comparison, subspace identification (SSID) methods are designed to select a feature set which preserves as much information as possible about state. In this paper we connect the two approaches, looking at the problem of reinforcement learning with a large set of features, each of which may only be marginally useful for value function approximation. We introduce a new algorithm for this situation, called Predictive State Temporal Difference (PSTD) learning. As in SSID for predictive state representations, PSTD finds a linear compression operator that projects a large set of features down to a small set that preserves the maximum amount of predictive information. As in RL, PSTD then uses a Bellman recursion to estimate a value function. We discuss the connection between PSTD and prior approaches in RL and SSID. We prove that PSTD is statistically consistent, perform several experiments that illustrate its properties, and demonstrate its potential on a difficult optimal stopping problem.

[1]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[2]  Lihong Li,et al.  An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[3]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[4]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[5]  Sebastian Thrun,et al.  Learning low dimensional predictive representations , 2004, ICML.

[6]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[7]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[8]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[9]  Craig Boutilier,et al.  Value-Directed Compression of POMDPs , 2002, NIPS.

[10]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[11]  Chang Wang,et al.  Compact Spectral Bases for Value Function Approximation Using Kronecker Factorization , 2007, AAAI.

[12]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[13]  Sridhar Mahadevan,et al.  Representation Policy Iteration , 2005, UAI.

[14]  Jiming Liu,et al.  A novel orthogonal NMF-based belief compression for POMDPs , 2007, ICML '07.

[15]  Michael H. Bowling,et al.  Learning predictive state representations using non-blind policies , 2006, ICML '06.

[16]  Tohru Katayama,et al.  Subspace Methods for System Identification , 2005 .

[17]  Byron Boots,et al.  Reduced-Rank Hidden Markov Models , 2009, AISTATS.

[18]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[19]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[21]  Bart De Moor,et al.  Subspace Identification for Linear Systems: Theory ― Implementation ― Applications , 2011 .

[22]  Yishay Mansour,et al.  Planning in POMDPs Using Multiplicity Automata , 2005, UAI.

[23]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[24]  H. Hotelling The most predictable criterion. , 1935 .

[25]  Nikos A. Vlassis,et al.  Improving Approximate Value Iteration Using Memories and Predictive State Representations , 2006, AAAI.

[26]  Sridhar Mahadevan,et al.  Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis , 2005, AAAI.

[27]  Sridhar Mahadevan,et al.  Compressing POMDPs Using Locality Preserving Non-Negative Matrix Factorization , 2010, AAAI.

[28]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[29]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[30]  G. Reinsel,et al.  Multivariate Reduced-Rank Regression: Theory and Applications , 1998 .

[31]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2011, Int. J. Robotics Res..

[32]  Stefano Soatto,et al.  Dynamic Data Factorization , 2001 .

[33]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.