论文信息 - Unifying n-Step Temporal-Difference Action-Value Methods

Unifying n-Step Temporal-Difference Action-Value Methods

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, the TD(λ) algorithm elegantly unifies temporal difference (TD) methods with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. The same type of unification is achievable with n-step algorithms, a simpler version of multi-step TD methods where updates consist of a single backup of length n instead of a geometric average of several backups of different lengths. In this work, we present a new n-step algorithm named Q(σ) that unifies two of the existing n-step algorithms for estimating action-value functions — Sarsa and Tree Backup. The fundamental difference between Sarsa and Tree Backup is that the former samples a single action at every step of the backup, whereas the latter takes an expectation over all the possible actions. We introduce a new parameter, σ ∈ [0, 1], that allows the degree of sampling performed by the algorithm at each step to be continuously varied. This creates a new family of algorithms that span a continuum between Sarsa (full sampling, σ = 1) and Tree Backup (pure expectation, σ = 0). Our results show that our algorithm can perform better when using intermediate values of σ instead of any of the extremes. Moreover, if we decay σ over time from one to zero, we obtain an algorithm that outperforms other variants of Q(σ) with a fixed σ over a variety of tasks. This work has three main contributions. First, we introduce our new algoii rithm, n-step Q(σ) and provide empirical evaluations of the algorithm in the tabular case. Second, we extend n-step Q(σ) to the linear function approximation case and demonstrate its performance in the environment mountain cliff. Third, we combined n-step Q(σ) with the DQN architecture and tested the performance of our new architecture — named the Q(σ) network — in the mountain car environment. Throughout our empirical evaluations, we found that the parameter σ often serves as a trade-off between initial and final performance. Moreover, we found that the decaying σ algorithm performed better than algorithms with fixed values of σ in terms of initial and final performance. We also found that in some domains n-step Q(σ) with an intermediate value of σ performed better than either of the extreme values corresponding to n-step Tree Backup and Sarsa. Our results represent a compelling argument for using n-step Q(σ) over n-step Sarsa or Tree Backup. n-step Q(σ) offers a flexible framework that can be adapted to the specifics of the learning task in order to improve

Juan F. García

[1] Kristopher De Asis. A Unified View of Multi-step Temporal Difference Learning , 2018 .

[2] Richard S. Sutton,et al. Per-decision Multi-step Temporal Difference Learning with Control Variates , 2018, UAI.

[3] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.

[4] Gang Pan,et al. A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning , 2018, IJCAI.

[5] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[6] Doina Precup,et al. Learning with Options that Terminate Off-Policy , 2017, AAAI.

[7] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[8] Pascal Vincent,et al. Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[9] Marc G. Bellemare,et al. The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[10] Richard S. Sutton,et al. Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[11] Markus Dumke. Double Q(σ) and Q(σ, λ): Unifying Reinforcement Learning Control Algorithms , 2017, ArXiv.