Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, Involving Few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such states can be viewed as an action, and there are trillions of them. Assuming the subgoal identification problem is already solved, the quality of any RL method—in real-world settings—depends less on how well it scales with the number of states than on how well it scales with the number of actions. This is where our new method T-Learning excels, by evaluating the relatively few possible transits from one state to another in a policy-independent way, rather than a huge number of state-action pairs, or states in traditional policy-dependent ways. Illustrative experiments demonstrate that performance improvements of T-Learning over Q-learning can be arbitrarily large. 1 Motivation and overview Traditional Reinforcement Learning (RL) has focused on problems involving many states and few actions, such as simple grid worlds. Most real world problems, however, are of the opposite type, involving few relevant states and many actions. For example, to return home from a conference, humans identify only few subgoal states such as lobby, taxi, airport etc. Each valid behavior connecting two such states can be viewed as an action, and there are trillions of them. Assuming the subgoal identification problem is already solved by a method outside the scope of this paper, the quality of any RL method—in real-world settings—depends less on how well it scales with the number of states than on how well it scales with the number of actions. Likewise, when we humans reach an unfamiliar state, we generally resist testing every possible action before determining the good states to transition to. We can, for example, observe the state transitions that other humans progress through while accomplishing the same task, or reach some rewarding state by happenstance. Then we can focus on reproducing that sequence of states. That is, we are able to first identify a task before acquiring the skills to reliably perform it. Take for example the task of walking along a balance beam. In order to traverse the length of the beam without falling, a precise action must be chosen at every step from a very large set of possibilities. The probability of failure is high because almost all actions at every step lead to imbalance and falling, and therefore a good deal of training is required to learn the precise movements that reliably take one across. However, throughout the procedure the desired trajectory of states is well understood; the more difficult part is achieving them reliably. Reinforcement-learning methods that learn action values, such as Q-learning, Sarsa, and TD(0) are guaranteed to converge to the optimal value function provided all state-action pairs in the underlying MDP are visited infinitely often. These methods therefore can converge extremely slowly in environments with large action spaces. Technical Report No. IDSIA-XX-11-2011 2 This paper introduces an elegant new algorithm that automatically focuses search in action space by learning state-transition values independent of action. We call the method T -learning, and it represents a novel off-policy approach to reinforcement learning. T -learning is a temporal-difference (TD) method [6], and as such it has much in common with other TD-methods, especially action-value methods, such as Sarsa and Q-learning [8, 9]. But it is quite different. Instead of learning the values of state-action pairs as action-value methods do, it learns the values of state-state pairs (here referred to as transitions). The value of the transitions between states is recorded explicitly, rather than the value of the states themselves or the value of state-action pairs. The learning task is decomposed into two separate and independent components: (1) the learning of transition values, (2) the learning of the optimal actions. The transition-value function allows high payoff transitions to be easily identified, allowing for a focused search in action space to discover those actions that make the valuable transitions reliably. Agents that learn the values of state-transitions can exhibit markedly different behavior from those that learn state-action pairs. Action-value methods are particularly suited to tasks with small action spaces where learning about all state-action pairs is not much more cumbersome than learning about the states alone. However, as the size of the action space increases, such methods become less feasible. Furthermore, action-value methods have no explicit mechanism for identifying valuable state transitions and focusing learning there. They lack an important—real-world—bias: that valuable state transitions can often be achieved with high reliability. As a result, in these common situations, action-value methods require extensive and undue search before converging to an optimal policy. T -learning, on the other hand, has an initial bias: it presumes the existence of reliable actions that will achieve any valuable transition yet observed. This bias enables the valuable transitions to be easily identified and search to be focused there. As a result, the difficulties induced by large action spaces are significantly reduced. 2 Environments requiring precision Consider the transition graph of an MDP, where the vertices of the graph are the states of the environment and the edges represent transitions between states. Define a function τ : S → S that maps s to the neighboring vertex s′ whose value under the optimal policy, V ∗(s′) is the highest of all the neighbors of s, where V ∗(s) is calculated using a given value for γ as though the agent had actions available in every state that can move it deterministically along the graph of the environment. The class of MDPs for which T-Learning is particularly suited can be described formally as follows: If s τ 7→ s′, then 1. E[Pr(s′|s, a ∈ A) > ], and 2. Pr(s′|s, a∗) > 1− , for some a∗ ∈ A, where is a small positive value. These environments are those where specific skills can accomplish tasks reliably. Walking across a balance beam, for example, requires specific skills. The first constraint ensures that the rewarding transitions are likely to be observed. The second constraint ensures that the transitions associated with large reward signals can be achieved by finding a specific skill, i.e., a reliable action. Without this guarantee, one might never attempt to acquire certain skills because the average outcome during learning may be undesirable. Consider the example of Figure 1a. This MDP has two parts, one requiring high skill (which yields large reward) and one requiring low skill (which yields small reward). Episodes begin in state 1 and end in states 4, 5, and 6. There are 2n + 1 actions and the transition table is defined as follows: from state 1, n actions, {a1, . . . , an}, take the agent to state 2 deterministically; n actions {an+1, . . . , a2n} take the agent to state 3 deterministically, and one action, a∗ ≡ a2n+1, takes the agent to either state 2 or 3 with equal probability. All actions from state 2 take the agent to state 4, ending the episode. From state 3, 2n actions move the agent to either state 5 or 6 with equal probability, while action a∗ moves the agent to state Technical Report No. IDSIA-XX-11-2011 3 R=1.1 2n P(1,{a }, 3)=1 j j P(1,{a }, 3)=1 1 n 1 P(3,{a }, 6)=0.5 j 2n
[1]
Peter Stone,et al.
Transfer Learning for Reinforcement Learning Domains: A Survey
,
2009,
J. Mach. Learn. Res..
[2]
Ben J. A. Kröse,et al.
Learning from delayed rewards
,
1995,
Robotics Auton. Syst..
[3]
Doina Precup,et al.
Learning Options in Reinforcement Learning
,
2002,
SARA.
[4]
Peter Dayan,et al.
Q-learning
,
1992,
Machine Learning.
[5]
Andrew W. Moore,et al.
Prioritized sweeping: Reinforcement learning with less data and less time
,
2004,
Machine Learning.
[6]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.
[7]
Richard S. Sutton,et al.
Dyna, an integrated architecture for learning, planning, and reacting
,
1990,
SGAR.
[8]
Bram Bakker,et al.
Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization
,
2003
.
[9]
Thomas G. Dietterich.
What is machine learning?
,
2020,
Archives of Disease in Childhood.
[10]
Andrew G. Barto,et al.
Building Portable Options: Skill Transfer in Reinforcement Learning
,
2007,
IJCAI.