Using bisimulation for policy transfer in MDPs (Extended Abstract)

1. MAIN RESULTS Much of the work on using Markov Decision Processes (MDPs) in artificial intelligence (AI) focuses on solving a single problem. However, AI agents often exist over a long period of time, during which they may be required to solve several related tasks. This type of scenario has motivated a significant amount of recent research in knowledge transfer methods for MDPs. The idea is to allow an agent to continue to re-use the expertise accumulated while solving past tasks over its lifetime (see Taylor & Stone, 2009, for a comprehensive survey ). We focus on transferring knowledge in MDPs that are fully specified by their state set S, action set A, reward function R : S×A→R and state transition probabilities P : S×A→Dist(S) (whereDist(S) is the set of distributions over the set S). A policy π is a function from states to actions, π : S → A. The value of a state s ∈ S under policy π is defined as Vπ(s) = Eπ{∑ t=0 γt rt+1|s0 = s}, where rt is the reward received at time step t, and γ ∈ (0,1) is a discount factor. Solving an MDP means finding the optimal value function V ∗(s) = maxπV(s), and the associated policy π∗. The action-value function, Q∗ : S×A→R gives the expected return for each state-action pair, if they are followed by the optimal policy thereafter. LetM1 = 〈S1,A1,P1,R1〉 andM2 = 〈S2,A2,P2,R1〉 be twoMDPs and let V ∗ 1 (Q ∗ 1) and V ∗ 2 (Q ∗ 2) denote their respective optimal value functions. Our goal is to provide methods for transferring a policy from ont MDP to the other, which ensuring strong theoretical guarantees regarding the expected return of the transferred policy in the new MDP. Our methods are based on bisimulation metrics, introduced by Ferns, Panangaden & Precup (2004) . Bisimulation is a notion