Estimation Monte Carlo sans modèle de politiques de décision

We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. This algorithm, named Model-free Monte Carlo (MFMC) estimator, averages cumulated rewards along a set of "broken trajectories" made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.

[1]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[2]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[3]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[6]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[7]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[8]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[9]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[10]  Louis Wehenkel,et al.  Model-Free Monte Carlo-like Policy Evaluation , 2010, AISTATS.

[11]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[12]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[13]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[14]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[15]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[16]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..