Reinforcement Learning of Contextual MDPs using Spectral Methods

We propose a new reinforcement learning (RL) algorithm for contextual Markov decision processes (CMDP) using spectral methods. CMDPs are structured MDPs where the dynamics and rewards depend on a smaller number of hidden states or contexts. If the mapping between the hidden and observed states is known a priori, then standard RL algorithms such as UCRL are guaranteed to attain low regret. Is it possible to achieve regret of the same order even when the mapping is unknown? We provide an affirmative answer in this paper. We exploit spectral methods to learn the mapping from hidden to observed states with guaranteed confidence bounds, and incorporate it into the UCRL-based framework to obtain order-optimal regret.