Deriving Explicit Control Policies for Markov Decision Processes Using Symbolic Regression

In this paper, we introduce a novel approach to optimizing the control of systems that can be modeled as Markov decision processes (MDPs) with a threshold-based optimal policy. Our method is based on a specific type of genetic program known as symbolic regression (SR). We present how the performance of this program can be greatly improved by taking into account the corresponding MDP framework in which we apply it. The proposed method has two main advantages: (1) it results in near-optimal decision policies, and (2) in contrast to other algorithms, it generates closed-form approximations. Obtaining an explicit expression for the decision policy gives the opportunity to conduct sensitivity analysis, and allows instant calculation of a new threshold function for any change in the parameters. We emphasize that the introduced technique is highly general and applicable to MDPs that have a threshold-based policy. Extensive experimentation demonstrates the usefulness of the method.

[1]  Mohammad Wahab Khan,et al.  A survey of application: genomics and genetic programming, a new frontier. , 2012, Genomics.

[2]  Sandjai Bhulai,et al.  Value Function Discovery in Markov Decision Processes With Evolutionary Algorithms , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Chelsea C. White,et al.  A Hybrid Genetic/Optimization Algorithm for Finite-Horizon, Partially Observed Markov Decision Processes , 2004, INFORMS J. Comput..

[5]  G. Koole A simple proof of the optimality of a threshold policy in a two-server queueing system , 1995 .

[6]  Casey A. Volino,et al.  A First Course in Stochastic Models , 2005, Technometrics.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Sandjai Bhulai,et al.  On the structure of value functions for threshold policies in queueing models , 2003, Journal of Applied Probability.

[9]  P. Nordin Genetic Programming III - Darwinian Invention and Problem Solving , 1999 .

[10]  Michael C. Fu,et al.  Evolutionary policy iteration for solving Markov decision processes , 2005, IEEE Transactions on Automatic Control.

[11]  Henk C. Tijms,et al.  A First Course in Stochastic Models: Tijms/Stochastic Models , 2003 .

[12]  Sandjai Bhulai,et al.  Approximate dynamic programming techniques for the control of time-varying queuing systems applied to call centers with abandonments and retrials , 2010 .

[13]  John R. Koza,et al.  Genetic Programming III - Darwinian Invention and Problem Solving , 1999, Evolutionary Computation.

[14]  Keith G. Lockyer Heuristic Procedures in Dynamic Programming , 1973 .

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[16]  Christopher Rose,et al.  Genetic algorithms applied to cellular call admission: local policies , 1997 .

[17]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[18]  Danny Barash,et al.  A Genetic Search In Policy Space For Solving Markov Decision Processes , 1999 .