Dynamic policy programming

In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL) methods on different problem domains. Our results show that, in all cases, DPP-based algorithms outperform other RL methods by a wide margin.

[1]  G. Pisier,et al.  The Law of Large Numbers and the Central Limit Theorem in Banach Spaces , 1976 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[6]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[7]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[12]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[13]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[14]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[15]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[16]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[18]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[19]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[20]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[21]  Peter L. Bartlett,et al.  An Introduction to Reinforcement Learning Theory: Value Function Methods , 2002, Machine Learning Summer School.

[22]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[23]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[24]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[25]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[26]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[27]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[28]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[29]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[30]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[31]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[34]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[35]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[36]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[37]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[38]  Tao Wang,et al.  Stable Dual Dynamic Programming , 2007, NIPS.

[39]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[40]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[41]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[42]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[43]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[44]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[45]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[46]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[47]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[48]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[49]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[50]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[51]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[52]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[53]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[54]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[55]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[56]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[57]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[58]  Doina Precup,et al.  An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[59]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[60]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.