论文信息 - Dynamic policy programming

Dynamic policy programming

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.

[1] G. Pisier,et al. The Law of Large Numbers and the Central Limit Theorem in Banach Spaces , 1976 .

[2] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[3] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[5] Reid G. Simmons,et al. Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[6] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[7] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[8] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[10] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[12] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[13] Xiao-Li Meng,et al. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[14] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[15] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17] Benjamin Van Roy,et al. On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[18] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[19] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[20] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[21] Peter L. Bartlett,et al. An Introduction to Reinforcement Learning Theory: Value Function Methods , 2002, Machine Learning Summer School.

[22] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[23] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.

[24] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[25] Vijay R. Konda,et al. OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[26] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[27] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[28] William D. Smart,et al. Interpolation-based Q-learning , 2004, ICML.

[29] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[30] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[31] Rémi Munos,et al. Error Bounds for Approximate Value Iteration , 2005, AAAI.

[32] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33] H. Kappen. Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[34] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[35] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.

[36] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[37] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[38] Tao Wang,et al. Stable Dual Dynamic Programming , 2007, NIPS.

[39] Tao Wang,et al. Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[40] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[41] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[42] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[43] Jan Peters,et al. Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[44] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.

[45] Marc Toussaint,et al. Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[46] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[47] Shalabh Bhatnagar,et al. Natural actorcritic algorithms. , 2009 .

[48] Lihong Li,et al. Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[49] Shie Mannor,et al. Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[50] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[51] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[52] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[53] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[54] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[55] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[56] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[57] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.

[58] Doina Precup,et al. An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[59] Jan Peters,et al. Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[60] Vicenç Gómez,et al. Optimal control as a graphical model inference problem , 2009, Machine Learning.