Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Abstract : Function approximation is necessary when applying RL to either Markov decision processes (MDPs) or semi-Markov decision processes (SMDPs) with very large state spaces. An often overlooked issue in approximating Q-functions in either framework arises when an action value update in a given state causes a large policy change in other states. Another way of stating this is to say that a small change in the Q-function results in a large change in the implied greedy policy. We call this sensitivity to changes in the Q-function the dynamic range problem and suggest that it may result in greatly increasing the number of training updates required to accurately approximate the optimal policy. We demonstrate that Advantage Learning solves the dynamic range problem in both frameworks and is more robust than some other RL algorithms on these problems. For an MDP, the Advantage Learning algorithm addresses this issue by re-scaling the dynamic range of action values within each state by a constant. For SMDPs the scaling constant can vary for each action.