Open Theoretical Questions in Reinforcement Learning

Reinforcement learning (RL) concerns the problem of a learning agent interacting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to behave in order to get the most reward. The environment is a Markov decision process (MDP) with state set, \( \mathcal{S} \), and action set, \( \mathcal{A} \). The agent and the environment interact in a sequence of discrete steps, t = 0, 1, 2,... The state and action at one time step, \( s_t \in \mathcal{S} \) and \( a_t \in \mathcal{A} \), determine the probability distribution for the state at the next time step, \( s_{t + 1} \in \mathcal{S} \) and, jointly, the distribution for the next reward, r t+1 ∈ ℜ. The agent’s objective is to chose each aint to maximize the subsequent return: $$ R_t = \sum\limits_{k = 0}^\infty {\gamma ^k r_{t + 1 + k} ,} $$ where the discount rate, 0 ≤ γ ≤ 1, determines the relative weighting of immediate and delayed rewards. In some environments, the interaction consists of a sequence of episodes, each starting in a given state and ending upon arrival in a terminal state, terminating the series above. In other cases the interaction is continual, without interruption, and the sum may have an infinite number of terms (in which case we usually assume γ < 1). Infinite horizon cases with γ = 1 are also possible though less common (e.g., see Mahadevan, 1996).

[1]  David Elkind,et al.  Learning: An Introduction , 1968 .

[2]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[5]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[6]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[7]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[8]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[9]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[10]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[13]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[14]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[15]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[16]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.