A Cautious Approach to Generalization in Reinforcement Learning

In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop.

[1]  S. Murphy,et al.  An experimental design for the development of adaptive treatment strategies , 2005, Statistics in medicine.

[2]  J. Ingersoll Theory of Financial Decision Making , 1987 .

[3]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[4]  Louis Wehenkel,et al.  Reinforcement Learning Versus Model Predictive Control: A Comparison on a Power System Problem , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[6]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[7]  D. Ernst Selecting concise sets of samples for a reinforcement learning agent , 2005 .

[8]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[9]  László Monostori,et al.  Value Function Based Reinforcement Learning in Changing Markovian Environments , 2008, J. Mach. Learn. Res..

[10]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[11]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[12]  Alberto Bemporad,et al.  Robust model predictive control: A survey , 1998, Robustness in Identification and Control.

[13]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[14]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[16]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[17]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[18]  Louis Wehenkel,et al.  Inferring bounds on the performance of a control policy from a sample of trajectories , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[19]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[20]  S. Murphy,et al.  PERFORMANCE GUARANTEES FOR INDIVIDUALIZED TREATMENT RULES. , 2011, Annals of statistics.