Kalman Temporal Differences

Because reinforcement learning suffers from a lack of scalability, online value (and Q-) function approximation has received increasing interest this last decade. This contribution introduces a novel approximation scheme, namely the Kalman Temporal Differences (KTD) framework, that exhibits the following features: sample-efficiency, non-linear approximation, non-stationarity handling and uncertainty management. A first KTD-based algorithm is provided for deterministic Markov Decision Processes (MDP) which produces biased estimates in the case of stochastic transitions. Than the eXtended KTD framework (XKTD), solving stochastic MDP, is described. Convergence is analyzed for special cases for both deterministic and stochastic transitions. Related algorithms are experimented on classical benchmarks. They compare favorably to the state of the art while exhibiting the announced features.

[1]  T. Söderström,et al.  Instrumental variable methods for system identification , 1983 .

[2]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[3]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[9]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[10]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[11]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[12]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[13]  Ralf Schoknecht,et al.  Optimality of Reinforcement Learning Algorithms with Linear Function Approximation , 2002, NIPS.

[14]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[15]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[16]  Jeffrey K. Uhlmann,et al.  Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.

[17]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[18]  Rudolph van der Merwe,et al.  Sigma-point kalman filters for probabilistic inference in dynamic state-space models , 2004 .

[19]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[20]  Sang Woo Kim,et al.  Consistent normalized least mean square filtering with noisy data matrix , 2005, IEEE Transactions on Signal Processing.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Yaakov Engel,et al.  Algorithms and representations for reinforcement learning (עם תקציר בעברית, תכן ושער נוסף: אלגוריתמים וייצוגים ללמידה מחיזוקים.; אלגוריתמים וייצוגים ללמידה מחיזוקים.) , 2005 .

[23]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[24]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[25]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[26]  D. Simon Optimal State Estimation: Kalman, H Infinity, and Nonlinear Approaches , 2006 .

[27]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[28]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[29]  Robert Fitch,et al.  Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation , 2007, ICML '07.

[30]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[31]  Thomas Martinetz,et al.  Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification , 2007, ICANN.

[32]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[33]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[34]  Gene H. Golub,et al.  Methods for modifying matrix factorizations , 1972, Milestones in Matrix Computation.

[35]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[36]  Matthieu Geist,et al.  Bayesian Reward Filtering , 2008, EWRL.

[37]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[38]  Matthieu Geist,et al.  Tracking in Reinforcement Learning , 2009, ICONIP.

[39]  Matthieu Geist,et al.  Kalman Temporal Differences: The deterministic case , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[40]  Matthieu Geist,et al.  Eligibility traces through colored noises , 2010, International Congress on Ultra Modern Telecommunications and Control Systems.

[41]  O. Pietquin,et al.  Managing Uncertainty within Value Function Approximation in Reinforcement Learning , 2010 .

[42]  Matthieu Geist,et al.  Statistically linearized least-squares temporal differences , 2010, International Congress on Ultra Modern Telecommunications and Control Systems.

[43]  Matthieu Geist,et al.  Revisiting Natural Actor-Critics with Value Function Approximation , 2010, MDAI.

[44]  O. Pietquin,et al.  Statistically linearized recursive least squares , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[45]  Olivier Buffet,et al.  Markov Decision Processes in Artificial Intelligence , 2010 .