Batch mode reinforcement learning based on the synthesis of artificial trajectories

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.

[1]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[2]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[3]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[4]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[5]  Leslie Pack Kaelbling,et al.  Recent Advances in Reinforcement Learning , 1996, Springer US.

[6]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[7]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[8]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[9]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[10]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[11]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[12]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[13]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[14]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15]  Pierre Geurts,et al.  Iteratively Extending Time Horizon Reinforcement Learning , 2003, ECML.

[16]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[19]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[20]  Louis Wehenkel,et al.  Clinical data based optimal STI strategies for HIV: a reinforcement learning approach , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[21]  Raphaël Marée,et al.  Reinforcement Learning with Raw Image Pixels as Input State , 2006, IWICPAS.

[22]  Raphaël Marée,et al.  Reinforcement learning with raw image pixels as state input , 2006 .

[23]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[24]  Daniele de Rigo,et al.  Neuro-dynamic programming for designing water reservoir network management policies , 2007 .

[25]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[26]  S. Timmer,et al.  Fitted Q Iteration with CMACs , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[27]  Joelle Pineau,et al.  Adaptive Treatment of Epilepsy via Batch-mode Reinforcement Learning , 2008, AAAI.

[28]  Louis Wehenkel,et al.  Risk-aware decision making and dynamic programming , 2008 .

[29]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[30]  Andrea Bonarini,et al.  Batch Reinforcement Learning for Controlling a Mobile Wheeled Pendulum Robot , 2008, IFIP AI.

[31]  B. Chakraborty Bias Correction and Confidence Intervals for Fitted Q-iteration , 2008 .

[32]  Louis Wehenkel,et al.  Inferring bounds on the performance of a control policy from a sample of trajectories , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[33]  Shie Mannor,et al.  Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.

[34]  Max Bramer,et al.  Artificial Intelligence in Theory and Practice II , 2009 .

[35]  Sergio M. Savaresi,et al.  Batch Reinforcement Learning for semi-active suspension control , 2009, 2009 IEEE Control Applications, (CCA) & Intelligent Control, (ISIC).

[36]  Louis Wehenkel,et al.  Reinforcement Learning Versus Model Predictive Control: A Comparison on a Power System Problem , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[37]  Louis Wehenkel,et al.  Model-Free Monte Carlo-like Policy Evaluation , 2010, AISTATS.

[38]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[39]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[40]  Marcello Restelli,et al.  Tree‐based reinforcement learning for optimal water reservoir operation , 2010 .

[41]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[42]  Louis Wehenkel,et al.  Generating Informative Trajectories by using Bounds on the Return of Control Policies , 2010 .

[43]  Martin A. Riedmiller,et al.  Deep learning of visual control policies , 2010, ESANN.

[44]  Louis Wehenkel,et al.  A Cautious Approach to Generalization in Reinforcement Learning , 2010, ICAART.

[45]  Louis Wehenkel,et al.  Towards Min Max Generalization in Reinforcement Learning , 2010, ICAART.

[46]  Raphaël Fonteneau,et al.  Contributions to Batch Mode Reinforcement Learning , 2011 .

[47]  Olivier Pietquin,et al.  Batch reinforcement learning for optimizing longitudinal driving assistance strategies , 2011, 2011 IEEE Symposium on Computational Intelligence in Vehicles and Transportation Systems (CIVTS) Proceedings.

[48]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..