Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial Part II: Learning to predict values

• Uses importance sampling to convert off-policy case to on-policy case • Convergence assured by theorem of Tsitsiklis & Van Roy (1997) • Survives the Bermuda triangle! BUT! • Variance can be high, even infinite (slow learning) • Difficult to use with continuous or large action spaces • Requires explicit representation of behavior policy (probability distribution) Option formalism An option is defined as a triple o = 〈I,π,β〉 • I ⊆ S is the set of states in which the option can be initiated • π is the internal policy of the option • β : S → [0, 1] is a stochastic termination condition We want to compute the reward model of option o: Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s,π,β}

[1]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[2]  Benjamin Van Roy,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[3]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[4]  R. Sutton Gain Adaptation Beats Least Squares , 2006 .

[5]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[6]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[7]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[8]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[9]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[10]  Richard S. Sutton,et al.  Temporal Abstraction in Temporal-difference Networks , 2005, NIPS.

[11]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[12]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[13]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[14]  Vladislav Tadic,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[15]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[18]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[19]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.