Value Function Discovery in Markov Decision Processes With Evolutionary Algorithms

In this paper, we introduce a novel method for the discovery of value functions for Markov decision processes (MDPs). This method, which we call value function discovery (VFD), is based on ideas from the evolutionary algorithm field. VFDs key feature is that it discovers descriptions of value functions that are algebraic in nature. This feature is unique, because the descriptions include the model parameters of the MDP. The algebraic expression of the value function discovered by VFD can be used in several scenarios, e.g., conversion to a policy (with one-step policy improvement) or control of systems with time-varying parameters. The work in this paper is a first step toward exploring potential usage scenarios of discovered value functions. We give a detailed description of VFD and illustrate its application on an example MDP. For this MDP, we let VFD discover an algebraic description of a value function that closely resembles the optimal value function. The discovered value function is then used to obtain a policy, which we compare numerically to the optimal policy of the MDP. The resulting policy shows near-optimal performance on a wide range of model parameters. Finally, we identify and discuss future application scenarios of discovered value functions.

[1]  J. Ben Atkinson,et al.  An Introduction to Queueing Networks , 1988 .

[2]  A. E. Eiben,et al.  Multi-Problem Parameter Tuning using BONESA , 2011 .

[3]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[4]  Sandjai Bhulai,et al.  Approximate dynamic programming techniques for the control of time-varying queuing systems applied to call centers with abandonments and retrials , 2010 .

[5]  J. M. Norman,et al.  Heuristic procedures in dynamic programming , 1972 .

[6]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[7]  Chelsea C. White,et al.  A Hybrid Genetic/Optimization Algorithm for Finite-Horizon, Partially Observed Markov Decision Processes , 2004, INFORMS J. Comput..

[8]  Sandjai Bhulai,et al.  On the structure of value functions for threshold policies in queueing models , 2003, Journal of Applied Probability.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Sandjai Bhulai,et al.  On the Control of a Queueing System with Aging State Information , 2015 .

[11]  Henk C. Tijms,et al.  A First Course in Stochastic Models: Tijms/Stochastic Models , 2003 .

[12]  Jean C. Walrand,et al.  An introduction to queueing networks , 1989, Prentice Hall International editions.

[13]  Danny Barash,et al.  A Genetic Search In Policy Space For Solving Markov Decision Processes , 1999 .

[14]  Michael C. Fu,et al.  An Evolutionary Random Policy Search Algorithm for Solving Markov Decision Processes , 2007, INFORMS J. Comput..

[15]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[16]  Agostinho C. Rosa,et al.  Controlling the Parameters of the Particle Swarm Optimization with a Self-Organized Criticality Model , 2012, PPSN.

[17]  G. Koole A simple proof of the optimality of a threshold policy in a two-server queueing system , 1995 .

[18]  Christopher Rose,et al.  Genetic algorithms applied to cellular call admission: local policies , 1997 .

[19]  C. Gearhart Genetic Programming as Policy Search in Markov Decision Processes , 2003 .

[20]  Michael C. Fu,et al.  Evolutionary policy iteration for solving Markov decision processes , 2005, IEEE Transactions on Automatic Control.

[21]  D. Wiesmann,et al.  Evolutionary Optimization Algorithms in Computational Optics , 1999 .

[22]  Vidroha Debroy,et al.  Genetic Programming , 1998, Lecture Notes in Computer Science.

[23]  H. Tijms A First Course in Stochastic Models , 2003 .