Addressing the policy-bias of q-learning by repeating updates

Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts if the optimal action is played with a low probability, a situation that may arise due to wrong initialization of Q-values or due to convergence to an almost pure policy after which a change in the environment makes another action optimal. These artifacts was resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL. We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.