Performance Guarantees for Model-Based Approximate Dynamic Programming in Continuous Spaces

We study both the value function and Q-function formulation of the linear programming approach to approximate dynamic programming. The approach is model based and optimizes over a restricted function space to approximate the value function or Q-function. Working in the discrete time, continuous space setting, we provide guarantees for the fitting error and online performance of the policy. In particular, the online performance guarantee is obtained by analyzing an iterated version of the greedy policy, and the fitting error guarantee by analyzing an iterated version of the Bellman inequality. These guarantees complement the existing bounds that appear in the literature. The Q-function formulation offers benefits, for example, in the decentralized controller design, however, it can lead to computationally demanding optimization problems. To alleviate this drawback, we provide a condition that simplifies the formulation, resulting in improved computational times.

[1]  Warren B. Powell,et al.  What you should know about approximate dynamic programming , 2009, Naval Research Logistics (NRL).

[2]  Alberto Bemporad,et al.  Predictive Control for Linear and Hybrid Systems , 2017 .

[3]  Benjamin Van Roy,et al.  An approximate dynamic programming approach to decentralized control of stochastic systems , 2006 .

[4]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[5]  Tingwen Huang,et al.  Model-Free Optimal Tracking Control via Critic-Only Q-Learning , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[7]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[8]  John Lygeros,et al.  Alleviating tuning sensitivity in Approximate Dynamic Programming , 2016, 2016 European Control Conference (ECC).

[9]  Moritz Diehl,et al.  Discrete-time stochastic optimal control via occupation measures and moment relaxations , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[10]  A. Ben-Tal,et al.  Adjustable robust solutions of uncertain linear programs , 2004, Math. Program..

[11]  Daniel Kuhn,et al.  Generalized decision rule approximations for stochastic programming via liftings , 2014, Mathematical Programming.

[12]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[13]  John Lygeros,et al.  Approximate dynamic programming via sum of squares programming , 2012, 2013 European Control Conference (ECC).

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  John Lygeros,et al.  The Linear Programming Approach to Reach-Avoid Problems for Markov Decision Processes , 2014, J. Artif. Intell. Res..

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Vivek F. Farias,et al.  Approximate Dynamic Programming via a Smoothed Linear Program , 2009, Oper. Res..

[18]  Derong Liu,et al.  Adaptive $Q$ -Learning for Data-Based Optimal Output Regulation With Experience Replay , 2018, IEEE Transactions on Cybernetics.

[19]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[20]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[21]  Stephen P. Boyd,et al.  Approximate dynamic programming via iterated Bellman inequalities , 2015 .

[22]  Benjamin Van Roy,et al.  On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[23]  P. Goulart,et al.  High-Speed Finite Control Set Model Predictive Control for Power Electronics , 2015, IEEE Transactions on Power Electronics.

[24]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[25]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[26]  Tingwen Huang,et al.  Optimal Output Regulation for Model-Free Quanser Helicopter With Multistep Q-Learning , 2018, IEEE Transactions on Industrial Electronics.

[27]  W. Fleming Book Review: Discrete-time Markov control processes: Basic optimality criteria , 1997 .

[28]  Meir Pachter,et al.  State aggregation based linear programming approach to approximate dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[29]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Robert H. Storer,et al.  An approximate dynamic programming approach for the vehicle routing problem with stochastic demands , 2009, Eur. J. Oper. Res..

[31]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[32]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[33]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[34]  Ole Christensen Functions, spaces, and expansions: mathematical tools in physics and engineering / Ole Christensen , 2010 .

[35]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[36]  John Lygeros,et al.  Approximate dynamic programming for stochastic reachability , 2013, 2013 European Control Conference (ECC).

[37]  S. Darbha,et al.  Approximate dynamic programming with state aggregation applied to UAV perimeter patrol , 2011 .

[38]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[39]  John Lygeros,et al.  Upper bounds for the reach-avoid probability via robust optimization , 2015, ArXiv.

[40]  Stephen P. Boyd,et al.  Min-max approximate dynamic programming , 2011, 2011 IEEE International Symposium on Computer-Aided Control System Design (CACSD).

[41]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[42]  Huai-Ning Wu,et al.  Policy Gradient Adaptive Dynamic Programming for Data-Based Optimal Control , 2017, IEEE Transactions on Cybernetics.

[43]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[44]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[45]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[46]  John Lygeros,et al.  Approximation of Constrained Average Cost Markov Control Processes , 2014, 53rd IEEE Conference on Decision and Control.

[47]  Stephen P. Boyd,et al.  Quadratic approximate dynamic programming for input‐affine systems , 2014 .

[48]  Stephen P. Boyd,et al.  Iterated approximate value functions , 2013, 2013 European Control Conference (ECC).