Tracking in Reinforcement Learning

Reinforcement learning induces non-stationarity at several levels. Adaptation to non-stationary environments is of course a desired feature of a fair RL algorithm. Yet, even if the environment of the learning agent can be considered as stationary, generalized policy iteration frameworks, because of the interleaving of learning and control, will produce non-stationarity of the evaluated policy and so of its value function. Tracking the optimal solution instead of trying to converge to it is therefore preferable. In this paper, we propose to handle this tracking issue with a Kalman-based temporal difference framework. Complexity and convergence analysis are studied. Empirical investigations of its ability to handle non-stationarity is finally provided.

[1]  Jeffrey K. Uhlmann,et al.  Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.

[2]  Matthieu Geist,et al.  Bayesian Reward Filtering , 2008, EWRL.

[3]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[4]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[5]  Leslie Pack Kaelbling,et al.  Recent Advances in Reinforcement Learning , 1996, Springer US.

[6]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[7]  Robert Fitch,et al.  Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation , 2007, ICML '07.

[8]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[11]  Matthieu Geist,et al.  Kalman Temporal Differences: The deterministic case , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[12]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[13]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[14]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[15]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[16]  Sang Woo Kim,et al.  Consistent normalized least mean square filtering with noisy data matrix , 2005, IEEE Transactions on Signal Processing.

[17]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[18]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[19]  Rudolph van der Merwe,et al.  Sigma-point kalman filters for probabilistic inference in dynamic state-space models , 2004 .

[20]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.