Reinforcement learning relies on the association between a goal and a scalar signal, interpreted as reward or punishment. The objective is not to reproduce some reference signal, but to progessively find, by trial and error, the policy maximizing the rewards. This paper presents the basis of reinforcement learning, and two model-free algorithms, Q-Learning and Fuzzy Q-Learning. 1 Presentation The objective of this paper is to present the main lines of reinforcement learning (RL), without pretending to make a complete tour of it. A more deepened survey is in preparation. More modestly, the three following points will be highlighted: • a general presentation of RL, with a theoretical development for a better understanding, • the description of some algorithms, in order to permit immediate applications, • the presentation of recent developments on the reciprocal contributions of fuzzy logic and RL. Contrary to the well-known supervised learning paradigm, that aims to model an input/output mapping, the reinforcement learning approach tries to make emerge behaviors permitting to reach an objective, without other information that a scalar signal, the reinforcement. In this type of training, an “agent” 1 is permanently analyzing the consequences of its actions, while having tendency to preferentially replicate those that, in the same circumstances, drove to successes. One can distinguish two main approaches for the resolution of such a type of problems: • a research in the space of behaviors, to determine those that permit to achieve the assigned task; this research generally use Genetic Algorithms [15, 35]; • the utilization of Dynamic Programming methods, formalizing this type of training by a Markovian Decision Problem . These two approaches give in practice comparable results [35], but the second has the advantage to offer a mathematical basis permitting to understand the process better: therefore it is developed here. The plan of the paper is the next one. • Section 2 gives basic elements of all RL algorithms. • The formalism of Markovian Decision Problems, presented in the following section, permits to establish a bridge between the reinforcement learning and methods of Dynamic Programming [4]. This formalism gives a theoretical basis for modeling the interactions between an agent and its environment. This section can be omitted in a first reading. • The Temporal Differences (TD) method, permitting to evaluate incrementally a policy, is presented in section 4. This method, formalized by Sutton [27], is the basis for most of RL algorithms. 1Agent is a very general term that designates here a system under training: a Neural Network, a Fuzzy Inference System, a computer program, etc. ESIT 2000, 14-15 September 2000, Aachen, Germany 5 17 • The Q-Learning, an algorithm based on the Temporal Differences and used to determine an optimal policy, makes the object of section 5. This algorithm is one of those for which the theory is most advanced and for which proofs of convergence exist. It doesn’t require the knowledge of probability transitions from a state to another and is model-free. • Some particular implementation problems are listed in section 6: choice of the reinforcement signal, exploration/exploitation dilemma, representing the Q-values and speeding-up learning methods. • A “fuzzy” version of Q-Learning is presented in section 7.3. This approach presents several advantages: it permits to treat continuous state and action spaces, to store the state-action values and to introduce a priori knowledge. 2 Reinforcement Learning 2.1 General Presentation Reinforcement learning concerns a family of problems in which an agent evolves while analyzing consequences of its actions, thanks to a simple scalar signal (the reinforcement) given out by the environment. This general definition puts in evidence two important features: • the agent interacts with its environment and the pair “agent + environment” constitutes a dynamic system; • the reinforcement signal, that is generally perceived in terms of reward or punishment, permits to the agent to modify its behavior. In supervised learning, called also “learning with a master”, the learning system knows the error that it commits at all times: for each input vector, the corresponding desired output is known. This difference between the actual and the reference output can be used to modify parameters. In reinforcement learning, or “learning with a critic”, the received signal is the sanction (positive, negative or neutral) of a behavior: this signal indicates what you have to do without saying how to do it. The agent uses this signal to determine a policy permitting to reach a long-term objective. Another difference between these two approaches is that reinforcement learning is fundamentally online, because the agent’s actions modifies the environment: to accomplish its task, the agent must link several actions, i.e. follow a policy and, more precisely, to determine the policy that will maximize the future rewards. The general process, schematized in Figure 1, is the following : 1 at time step t, the agent is in state x(t), 2 it chooses one of the possible actions in this state, a(t), 3 it applies the action, what provokes,: • the passage to a new state, x(t + 1), • the receipt of the reinforcement, r(t); 4 t← t + 1 5 go to 2 or stop if the new state is a terminal one. Let X be the set of states and A the set of actions. The reinforcement r(t) is the consequence of action a(t) chosen in state x(t). The reinforcement function is an application of the product space X × A in R (r : X × A → R). In a first time, spaces X and A are supposed discrete. Extensions to continuous state and action spaces will be treated in paragraphs 6.3 and 7.3. ESIT 2000, 14-15 September 2000, Aachen, Germany 6 18
[1]
Richard S. Sutton,et al.
Learning to predict by the methods of temporal differences
,
1988,
Machine Learning.
[2]
Rémi Munos.
Apprentissage par renforcement, étude du cas continu
,
1997
.
[3]
Nicolas Meuleau.
Le dilemme entre exploration et exploitation dans l'apprentissage par renforcement : optimisation adaptative des modeles de decision multi-etats
,
1996
.
[4]
L. Darrell Whitley,et al.
Genetic Reinforcement Learning for Neurocontrol Problems
,
2004,
Machine Learning.
[5]
John N. Tsitsiklis,et al.
Asynchronous stochastic approximation and Q-learning
,
1994,
Mach. Learn..
[6]
Richard S. Sutton,et al.
Learning and Sequential Decision Making
,
1989
.
[7]
Abdollah Homaifar,et al.
Simultaneous design of membership functions and rule sets for fuzzy controllers using genetic algorithms
,
1995,
IEEE Trans. Fuzzy Syst..
[8]
P. Glorennec,et al.
Fuzzy Q-learning
,
1997,
Proceedings of 6th International Fuzzy Systems Conference.
[9]
Anton Schwartz,et al.
A Reinforcement Learning Method for Maximizing Undiscounted Rewards
,
1993,
ICML.
[10]
Steven D. Whitehead,et al.
A Complexity Analysis of Cooperative Mechanisms in Reinforcement Learning
,
1991,
AAAI.
[11]
S. W. Piche,et al.
Steepest descent algorithms for neural network controllers and filters
,
1994,
IEEE Trans. Neural Networks.
[12]
Jeremy Wyatt,et al.
Exploration and inference in learning from reinforcement
,
1998
.
[13]
Long Ji Lin,et al.
Self-improvement Based on Reinforcement Learning, Planning and Teaching
,
1991,
ML.
[14]
Lionel Jouffe,et al.
Fuzzy inference system learning by reinforcement methods
,
1998,
IEEE Trans. Syst. Man Cybern. Part C.
[15]
Hyung Suck Cho,et al.
A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning
,
1995,
IEEE Trans. Syst. Man Cybern..
[16]
Tucker Balch,et al.
Learning Roles: Behavioral Diversity in Robot Teams
,
1997
.
[17]
Paul E. Utgoff,et al.
A Teaching Method for Reinforcement Learning
,
1992,
ML.
[18]
Andrew McCallum,et al.
Using Transitional Proximity for Faster Reinforcement Learning
,
1992,
ML.
[19]
Sebastian Thrun,et al.
Active Exploration in Dynamic Environments
,
1991,
NIPS.
[20]
Claude F. Touzet,et al.
Neural reinforcement learning for behaviour synthesis
,
1997,
Robotics Auton. Syst..
[21]
Richard S. Sutton,et al.
Neuronlike adaptive elements that can solve difficult learning control problems
,
1983,
IEEE Transactions on Systems, Man, and Cybernetics.
[22]
Chris Watkins,et al.
Learning from delayed rewards
,
1989
.
[23]
Richard S. Sutton,et al.
Introduction to Reinforcement Learning
,
1998
.
[24]
P. Y. Glorennec,et al.
Fuzzy Q-learning and dynamical fuzzy Q-learning
,
1994,
Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference.
[25]
Jing Peng,et al.
Incremental multi-step Q-learning
,
1994,
Machine Learning.
[26]
Mahesan Niranjan,et al.
On-line Q-learning using connectionist systems
,
1994
.