Feature Selection for Neuro-Dynamic Programming

AbstractNeuro-Dynamic Programming encompasses techniques from both reinforcement learn-ing and approximate dynamic programming. Feature selection refers to the choice of basisthat de nes the function class that is required in the application of these techniques.This chapter reviews two popular approaches to neuro-dynamic programming, TD-learning and Q-learning. The main goal of the chapter is to demonstrate how insightfrom idealized models can be used as a guide for feature selection for these algorithms.Several approaches are surveyed, including uid and di usion models, and the applicationof idealized models arising from mean- eld game approximations. The theory is illustratedwith several examples.Keywords: Optimal Control, Stochastic Control, Approximate Dynamic Programming,Reinforcement Learning2000 AMS Subject Classi cation: 49L20, 93E20, 93E35, 60J10 1 Introduction If you have taken a course that mentioned a Riccati equation, then you have already beenexposed to approximate dynamic programming: No physical system is linear! The modelis assumed to be linear, and the cost function is assumed quadratic, so that a closed-formexpression for the optimal feedback law can be computed.In many cases, a linear approximation is not easily justi ed. In particular, in applicationsfrom operations research or chemical engineering, the state space is usually constrained. Toobtain an e ective feedback law for control, we must nd alternative approaches to approxi-mation.Techniques from approximate dynamic programming and reinforcement learning are de-signed to obtain such approximations [1, 18]. In particular, in TD-learning and relatedapproaches, there is no attempt to approximate the system or cost function. Instead, thesolution to a dynamic programming equation is approximated directly, within a prescribed nite-dimensional function class. A key determinant of the success of these techniques is theselection of this function class, also known as the basis.

[1]  C. Watkins Learning from delayed rewards , 1989 .

[2]  Vivek S. Borkar,et al.  Optimal Control of Diffusion Processes , 1989 .

[3]  Lawrence M. Wein,et al.  Dynamic Scheduling of a Multiclass Make-to-Stock Queue , 2015, Oper. Res..

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[9]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[10]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[11]  Vivek S. Borkar,et al.  Convex Analytic Methods in Markov Decision Processes , 2002 .

[12]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[13]  Sean P. Meyn,et al.  Performance Evaluation and Policy Selection in Multiclass Networks , 2003, Discret. Event Dyn. Syst..

[14]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[15]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[16]  Benjamin Van Roy,et al.  An approximate dynamic programming approach to decentralized control of stochastic systems , 2006 .

[17]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[18]  Minyi Huang,et al.  Large-Population Cost-Coupled LQG Problems With Nonuniform Agents: Individual-Mass Behavior and Decentralized $\varepsilon$-Nash Equilibria , 2007, IEEE Transactions on Automatic Control.

[19]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[20]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[21]  Sean P. Meyn,et al.  Shannon meets Bellman: Feature based Markovian models for detection and optimization , 2008, 2008 47th IEEE Conference on Decision and Control.

[22]  Sean P. Meyn,et al.  Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[23]  Sean P. Meyn,et al.  A DYNAMIC NEWSBOY MODEL FOR OPTIMAL RESERVE MANAGEMENT IN ELECTRICITY MARKETS , 2009 .

[24]  Adam Wierman,et al.  Approximate dynamic programming using fluid and diffusion approximations with applications to power management , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[25]  Vivek S. Borkar,et al.  A New Learning Algorithm for Optimal Stopping , 2009, Discret. Event Dyn. Syst..

[26]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[27]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.