In this document I will provide an explanation of the Bellman equation, which is a method for optimizing a cost function and arriving at a control policy. 1. Example of a game Suppose that our states x refer to the position on a grid, as shown below. If we are at the goal state, then the state cost per time step is zero. If we are at any other state, the state cost per time step is 5. Let us use the term x J to refer to this state cost per time step: The goal state is at row 2, col. 2, which means that if we are at this state, we incur no state costs. The double lines refer to a 'wall', preventing one to move from one state to the neighboring state. That is, there is a wall between the top left and top middle states. If we perform some action u (say, move from one box to the neighboring box), there will be a motor cost per time step, which we refer to with symbol u J. The motor cost is one if we move 1 u J = , and zero otherwise. So the total cost per time step is: () () n n x u J J α = + (2) The term () () n x π refers to the policy that we have. This policy specifies the action () () n u x that we will perform for each state x at time point n. For example, if we pick a random policy, then we might have actions that look like this: Suppose our final time step is p. If we are now at time point k , our objective is to find the policy that minimizes the total cost to go () p i i k α = ∑. Let us define the goodness of each policy via a value function: