Model-Free Monte Carlo-like Policy Evaluation

We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.

[1]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[2]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[3]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[5]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[6]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[7]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[10]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[11]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[12]  J. Ingersoll Theory of Financial Decision Making , 1987 .

[13]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[14]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[15]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[16]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[17]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[18]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .