TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces faces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q($lambda$) and Naive Q($lambda$) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(σ), which effectively unifies the tree-backup algorithm and Naive Q($lambda$). By introducing a new parameter σ to illustrate the degree of utilizing traces, TBQ(σ) creates an effective integration of TB($lambda$) and Naive Q($lambda$) and continuous role shift between them. The contraction property of TB(σ) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ(σ) and give the convergence proof. We empirically show that, for e\in(0,1]$ in e-greedy policies, there exists some degree of utilizing traces for $lambda\in[0,1]$, which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[5]  Pengfei Li,et al.  Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Gang Pan,et al.  A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning , 2018, IJCAI.

[7]  Hado Philip van Hasselt,et al.  Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms , 2011 .

[8]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[11]  Lakhmi C. Jain,et al.  Experimental analysis on Sarsa(lambda) and Q(lambda) with different eligibility traces strategies , 2009, J. Intell. Fuzzy Syst..

[12]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  D. Barrios-Aranibar,et al.  LEARNING FROM DELAYED REWARDS USING INFLUENCE VALUES APPLIED TO COORDINATION IN MULTI-AGENT SYSTEMS , 2007 .

[14]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[15]  Richard S. Sutton,et al.  Multi-step Reinforcement Learning: A Unifying Algorithm , 2017, AAAI.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Marc G. Bellemare,et al.  Q(λ) with Off-Policy Corrections , 2016, ALT.

[18]  Doina Precup,et al.  A new Q(lambda) with interim forward view and Monte Carlo equivalence , 2014, ICML.

[19]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.