Co)Algebraic Techniques for Markov Decision Processes

Markov Decision Processes (MDPs) [11] are a family of probabilistic, state-based models used in planning under uncertainty and reinforcement learning. Informally, an MDP models a situation in which an agent (the decision maker) makes choices at each state of a process, and each choice leads to some reward and a probabilistic transition to a next state. The aim of the agent is to find an optimal policy, i.e., a way of choosing actions that maximizes future expected rewards. The classic theory of MDPs with discounting is well-developed (see [11, Chapter 6]), and indeed we do not prove any new results about MDPs as such. Our work is inspired by Bellman’s principle of optimality, which states the following: “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision” [2, Chapter III.3]. This principle has clear coinductive overtones, and our aim is to situate it in a body of mathematics that is also concerned with infinite behavior and coinductive proof principles, i.e., in coalgebra. Probabilistic systems of similar type have been studied extensively, also coalgebraically, in the area of program semantics (see for instance [5, 6, 14, 15]). Our focus is not so much on the observable behavior of MDPs viewed as computations, but on their role in solving optimal planning problems. This abstract is based on [7] to which we refer for a more detailed account.