TD-learning with exploration

We introduce exploration in the TD-learning algorithm to approximate the value function for a given policy. In this way we can modify the norm used for approximation, “zooming in” to a region of interest in the state space. We also provide extensions to SARSA to eliminate the need for numerical integration in policy improvement. Construction of the algorithm and its analysis build on recent general results concerning the spectral theory of Markov chains and positive operators.

[1]  J. Michael Harrison,et al.  Dynamic Control of a Queue with Adjustable Service Rate , 2001, Oper. Res..

[2]  Sean P. Meyn,et al.  Optimal cross-layer wireless control policies using TD learning , 2010, 49th IEEE Conference on Decision and Control (CDC).

[3]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[4]  A. Wierman,et al.  Optimality, fairness, and robustness in speed scaling designs , 2010, SIGMETRICS '10.

[5]  Minyi Huang,et al.  Large-Population Cost-Coupled LQG Problems With Nonuniform Agents: Individual-Mass Behavior and Decentralized $\varepsilon$-Nash Equilibria , 2007, IEEE Transactions on Automatic Control.

[6]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[7]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[8]  E. Nummelin General irreducible Markov chains and non-negative operators: List of symbols and notation , 1984 .

[9]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[10]  Sean P. Meyn,et al.  Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[11]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  S. Meyn,et al.  Large Deviations Asymptotics and the Spectral Theory of Multiplicatively Regular Markov Processes , 2005, math/0509310.

[14]  S. Meyn,et al.  Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[15]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[16]  Eugene A. Feinberg,et al.  Handbook of Markov Decision Processes , 2002 .

[17]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[18]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[19]  Sunil Kumar,et al.  Decision , Risk & Operations Working Papers Series Approximate and Data-Driven Dynamic Programming for Queueing Networks , 2008 .

[20]  Sean P. Meyn,et al.  Quasi stochastic approximation , 2011, Proceedings of the 2011 American Control Conference.

[21]  A. F. Veinott Discrete Dynamic Programming with Sensitive Discount Optimality Criteria , 1969 .

[22]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[23]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[24]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[25]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[26]  Adam Wierman,et al.  Approximate dynamic programming using fluid and diffusion approximations with applications to power management , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[27]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[28]  Anant Sahai,et al.  Towards a Communication-Theoretic Understanding of System-Level Power Consumption , 2010, IEEE Journal on Selected Areas in Communications.

[29]  Sean P. Meyn,et al.  Feature Selection for Neuro-Dynamic Programming , 2013 .

[30]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[31]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[32]  M. Veatch Approximate Dynamic Programming for Networks : Fluid Models and Constraint Reduction , 2004 .

[33]  S. Meyn,et al.  Multiplicative ergodicity and large deviations for an irreducible Markov chain , 2000 .

[34]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[35]  Benjamin Van Roy,et al.  An approximate dynamic programming approach to decentralized control of stochastic systems , 2006 .

[36]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[37]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[38]  E. Nummelin General irreducible Markov chains and non-negative operators: Positive and null recurrence , 1984 .

[39]  S. Balajia,et al.  Multiplicative ergodicity and large deviations for an irreducible Markov chain ( , 2022 .