论文信息 - Reinforcement Learning of Contextual MDPs using Spectral Methods

Reinforcement Learning of Contextual MDPs using Spectral Methods

We propose a new reinforcement learning (RL) algorithm for contextual Markov decision processes (CMDP) using spectral methods. CMDPs are structured MDPs where the dynamics and rewards depend on a smaller number of hidden states or contexts. If the mapping between the hidden and observed states is known a priori, then standard RL algorithms such as UCRL are guaranteed to attain low regret. Is it possible to achieve regret of the same order even when the mapping is unknown? We provide an affirmative answer in this paper. We exploit spectral methods to learn the mapping from hidden to observed states with guaranteed confidence bounds, and incorporate it into the UCRL-based framework to obtain order-optimal regret.

Kamyar Azizzadenesheli | Anima Anandkumar | Alessandro Lazaric

[1] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[2] Shie Mannor,et al. Contextual Markov Decision Processes , 2015, ArXiv.

[3] Kamyar Azizzadenesheli,et al. Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[4] Milos Hauskrecht,et al. Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[5] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[6] Le Song,et al. Nonparametric Estimation of Multi-View Latent Variable Models , 2013, ICML.

[7] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[8] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9] Shuai Li,et al. Online Clustering of Bandits , 2014, ICML.

[10] Aditya Gopalan,et al. Low-rank Bandits with Latent Mixtures , 2016, ArXiv.

[11] J. Tropp. FREEDMAN'S INEQUALITY FOR MATRIX MARTINGALES , 2011, 1101.3039.

[12] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[13] John Langford,et al. Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations , 2016, ArXiv.

[14] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..