论文信息 - An Empirical Dynamic Programming Algorithm for Continuous MDPs

An Empirical Dynamic Programming Algorithm for Continuous MDPs

We propose universal randomized function approximation-based empirical value iteration (EVI) algorithms for Markov decision processes. The `empirical' nature comes from each iteration being done empirically from samples available from simulations of the next state. This makes the Bellman operator a random operator. A parametric and a non-parametric method for function approximation using a parametric function space and the Reproducing Kernel Hilbert Space (RKHS) respectively are then combined with EVI. Both function spaces have the universal function approximation property. Basis functions are picked randomly. Convergence analysis is done using a random operator framework with techniques from the theory of stochastic dominance. Finite time sample complexity bounds are derived for both universal approximate dynamic programming algorithms. Numerical experiments support the versatility and effectiveness of this approach.

Rahul Jain | W. Haskell | Hiteshi Sharma | Pengqian Yu

[1] Pengqian Yu,et al. Randomized function fitting-based empirical value iteration , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[2] William B. Haskell,et al. Empirical Dynamic Programming , 2013, Math. Oper. Res..

[3] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[4] Vivek F. Farias,et al. Non-parametric Approximate Dynamic Programming via the Kernel Method , 2012, NIPS.

[5] Guy Lever,et al. Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[6] Jan Peters,et al. Policy Gradient Methods , 2010, Encyclopedia of Machine Learning.

[7] Pravin Varaiya,et al. Simulation-based optimization of Markov decision processes: An empirical process theory approach , 2010, Autom..

[8] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[9] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[10] Elizabeth L. Wilmer,et al. Markov Chains and Mixing Times , 2008 .

[11] Benjamin Recht,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[12] A. Rahimi,et al. Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[13] Stochastic Orders , 2008 .

[14] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[15] AI Koan,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[16] Warren B. Powell,et al. Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[17] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[18] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.

[19] S. Smale,et al. Shannon sampling II: Connections to learning theory , 2005 .

[20] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21] Dimitri P. Bertsekas,et al. Dynamic Programming and Suboptimal Control: A Survey from ADP to MPC , 2005, Eur. J. Control.

[22] Benjamin Van Roy,et al. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..

[23] Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[24] Rémi Munos,et al. Error Bounds for Approximate Policy Iteration , 2003, ICML.