Two Timescale Stochastic Approximation with Controlled Markov noise

We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.

[1]  Huizhen Yu,et al.  Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[2]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[3]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[4]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[5]  V. Tadić Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema , 2009, 49th IEEE Conference on Decision and Control (CDC).

[6]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[7]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[8]  Vivek S. Borkar,et al.  Stochastic approximation with 'controlled Markov' noise , 2006, Systems & control letters (Print).

[9]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[10]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[11]  V. Tadić Almost sure convergence of two time-scale stochastic approximation algorithms , 2004, Proceedings of the 2004 American Control Conference.

[12]  John N. Tsitsiklis,et al.  Linear stochastic approximation driven by slowly varying Markov chains , 2003, Syst. Control. Lett..

[13]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.

[14]  V. Borkar Stochastic approximation with two time scales , 1997 .

[15]  V. Borkar Probability Theory: An Advanced Course , 1995 .

[16]  A. Shwartz,et al.  Stochastic approximations for finite-state Markov chains , 1990 .

[17]  J. Aubin,et al.  Differential inclusions set-valued maps and viability theory , 1984 .

[18]  W. Rudin Principles of mathematical analysis , 1964 .