Unifying n-Step Temporal-Difference Action-Value Methods

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, the TD(λ) algorithm elegantly unifies temporal difference (TD) methods with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. The same type of unification is achievable with n-step algorithms, a simpler version of multi-step TD methods where updates consist of a single backup of length n instead of a geometric average of several backups of different lengths. In this work, we present a new n-step algorithm named Q(σ) that unifies two of the existing n-step algorithms for estimating action-value functions — Sarsa and Tree Backup. The fundamental difference between Sarsa and Tree Backup is that the former samples a single action at every step of the backup, whereas the latter takes an expectation over all the possible actions. We introduce a new parameter, σ ∈ [0, 1], that allows the degree of sampling performed by the algorithm at each step to be continuously varied. This creates a new family of algorithms that span a continuum between Sarsa (full sampling, σ = 1) and Tree Backup (pure expectation, σ = 0). Our results show that our algorithm can perform better when using intermediate values of σ instead of any of the extremes. Moreover, if we decay σ over time from one to zero, we obtain an algorithm that outperforms other variants of Q(σ) with a fixed σ over a variety of tasks. This work has three main contributions. First, we introduce our new algoii rithm, n-step Q(σ) and provide empirical evaluations of the algorithm in the tabular case. Second, we extend n-step Q(σ) to the linear function approximation case and demonstrate its performance in the environment mountain cliff. Third, we combined n-step Q(σ) with the DQN architecture and tested the performance of our new architecture — named the Q(σ) network — in the mountain car environment. Throughout our empirical evaluations, we found that the parameter σ often serves as a trade-off between initial and final performance. Moreover, we found that the decaying σ algorithm performed better than algorithms with fixed values of σ in terms of initial and final performance. We also found that in some domains n-step Q(σ) with an intermediate value of σ performed better than either of the extreme values corresponding to n-step Tree Backup and Sarsa. Our results represent a compelling argument for using n-step Q(σ) over n-step Sarsa or Tree Backup. n-step Q(σ) offers a flexible framework that can be adapted to the specifics of the learning task in order to improve

[1]  Kristopher De Asis A Unified View of Multi-step Temporal Difference Learning , 2018 .

[2]  Richard S. Sutton,et al.  Per-decision Multi-step Temporal Difference Learning with Control Variates , 2018, UAI.

[3]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[4]  Gang Pan,et al.  A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning , 2018, IJCAI.

[5]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[6]  Doina Precup,et al.  Learning with Options that Terminate Off-Policy , 2017, AAAI.

[7]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[8]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[9]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[10]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[11]  Markus Dumke Double Q(σ) and Q(σ, λ): Unifying Reinforcement Learning Control Algorithms , 2017, ArXiv.

[12]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[13]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[18]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[19]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[22]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[23]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[24]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[25]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[26]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[27]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[28]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[29]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[30]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[31]  C. Watkins Learning from delayed rewards , 1989 .

[32]  James S. Albus,et al.  Brains, behavior, and robotics , 1981 .

[33]  J. Albus A Theory of Cerebellar Function , 1971 .