Configurable Markov Decision Processes

In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach and derived some theoretical results, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy.

[1]  Robert L. Smith,et al.  A Linear Programming Approach to Nonstationary Infinite-Horizon Markov Decision Processes , 2013, Oper. Res..

[2]  Chelsea C. White,et al.  Markov Decision Processes with Imprecise Transition Probabilities , 1994, Oper. Res..

[3]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[4]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..

[5]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[6]  F. Cozman,et al.  Representing and solving factored markov decision processes with imprecise probabilities , 2009 .

[7]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[8]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[9]  L. V. D. Heyden,et al.  Perturbation bounds for the stationary probabilities of a finite Markov chain , 1984 .

[10]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[11]  John M Gozzolino,et al.  MARKOVIAN DECISION PROCESSES WITH UNCERTAIN TRANSITION PROBABILITIES , 1965 .

[12]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[13]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[14]  Takayuki Osogami,et al.  Robust partially observable Markov decision process , 2015, ICML.

[15]  Marcello Restelli,et al.  Adaptive Batch Size for Safe Policy Gradients , 2017, NIPS.

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.

[18]  Stephen J. Wright,et al.  A Fast and Reliable Policy Improvement Algorithm , 2016, AISTATS.

[19]  Denis Deratani Mauá,et al.  Modeling Markov Decision Processes with Imprecise Probabilities Using Probabilistic Logic Programming , 2017, ISIPTA.

[20]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[21]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Bruce Lee Bowerman,et al.  Nonstationary Markov decision processes and related topics in nonstationary Markov chains , 1974 .

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[25]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[26]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[27]  Thomas L. Griffiths,et al.  Faster Teaching by POMDP Planning , 2011, AIED.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Zhi-Qiang Liu,et al.  Bounded-Parameter Partially Observable Markov Decision Processes , 2008, ICAPS.

[30]  Robert L. Smith,et al.  Solution and Forecast Horizons for Infinite-Horizon Nonhomogeneous Markov Decision Processes , 2007, Math. Oper. Res..

[31]  Robert L. Smith,et al.  Solving Nonstationary Infinite Horizon Dynamic Optimization Problems , 2000 .

[32]  Robert L. Smith,et al.  A New Optimality Criterion for Nonhomogeneous Markov Decision Processes , 1987, Oper. Res..