No-regret Reinforcement Learning

This article surveys a stream of work in reinforcement learning (RL) focussed on the online objective. The traditional objective in RL has been to identify a good behavioural policy with as few interactions with an unknown environment as possible, but without explicitly accounting for reward earned along the way. Many sequential decision-making settings, however, require algorithms that can ensure high cumulative reward or low regret during the learning horizon. We discuss regret minimizing learning algorithms in a variety of RL settings, including the basic multi-armed bandit or stateless Markov Decision Process (MDP), structured (parametric and nonparametric) bandits, online learning in the tabula rasa MDP setting, and in structured (parametric and nonparametric) MDPs.