Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward

Abstract Reinforcement learning systems are often required to find not deterministic policies, but stochastic ones. They are also required to gain more reward while learning. Q-learning has not been designed for stochastic policies, and does not guarantee rational behavior on the halfway of learning. This paper presents a new reinforcement learning approach based on a simple credit-assignment for finding memory-less policies. It satisfies the above requirements with considering the policy and the exploration strategy identically. The mathematical analysis shows the proposed method is a stochastic gradient ascent on discounted reward in Markov decision processes (MDPs), and is related to the average-reward framework. The analysis assures that the proposed method can be extended to continuous environments. We also investigate its behavior in comparison with Q-learning on a small MDP example and a non-Markovian one.