Improving Policies without Measuring Merits

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the advantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function. For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.