Improved Switching among Temporally Abstract Actions

In robotics and other control applications it is commonplace to have a preexisting set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be obtained by switching flexibly between given controllers, and example applications of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were handcrafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelligent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally extended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions. Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e.g., Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5. Improved Switching among Temporally Abstract Actions 1067 1 Reinforcement Learning (MDP) Framework In a Markov decision process (MDP), an agent interacts with an environment at some discrete, lowest-level time scale t = 0,1,2, ... On each time step, the agent perceives the state of the environment, St E S, and on that basis chooses a primitive action, at E A. In response to each action, at, the environment produces one step later a numerical reward, Tt+l' and a next state, StH. The one-step model of the environment consists of the one-step statetransition probabilities and the one-step expected rewards, p~s' = Pr{sHl = s' I St = S,at = a} and T~ = E{TtH 1st = S,at = a}, for all s, s' E S and a E A. The agent's objective is to learn an optimal Markov policy, a mapping from states to probabilities of taking each available primitive action, 7r : S x A -+ [0, 1], that maximizes the expected discounted future reward from each state s: V 1T (s) = E{Tt+l +,Tt+2 + ... \ St = S,7r} = L 7r(s,a)[T~ +, LP~S,V1T(S')], aEA. s' where 7r(s, a) is the probability with which the policy 7r chooses action a E As in state s, and , E [0, 1] is a discount-rate parameter. V1T (s) is called the value of state S under policy 7r, and V1T is called the state-value Junction for7r. The optimal state-value function gives the value of a state under an optimal policy: V*(s) = max1T V1T(S) = maxaEA.[T~ +,2:s' P~SI V*(s')]. Given V*, an optimal policy is easily formed by choosing in each state S any action that achieves the maximum in this equation. A parallel set of value functions, denoted Q1T and Q*, and Bellman equations can be defined for state-action pairs, rather than for states. Planning in reinforcement learning refers to the use of models of the environment to compute value functions and thereby to optimize or improve policies.