Actual Return Reinforcement Learning versus Temporal Differences: Some Theoretical and Experimental Results

This paper argues that for many domains, we can expect credit-assignment methods that use actual returns to be more eeective for reinforcement learning than the more commonly used temporal diierence methods. We present analysis and empirical evidence from three sets of experiments in diierent domains to support this claim. A new algorithm we call C-Trace, a variant of the P-Trace RL algorithm is introduced, and some possible advantages of using algorithms of this type are discussed.

[1]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[2]  Claude Sammut,et al.  Recent progress with BOXES , 1994, Machine Intelligence 13.

[3]  Mark D. Pendrith On Reinforcement Learning of Control Actions in Noisy and Non-Markovian Domains , 1994 .

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Pawea Cichosz Truncating Temporal Diierences: on the Eecient Implementation of Td() for Reinforcement Learning , 1995 .

[6]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[8]  Rodney A. Brooks,et al.  Intelligence Without Reason , 1991, IJCAI.

[9]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[10]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[11]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[12]  Andrew G. Barto,et al.  Monte Carlo Matrix Inversion and Reinforcement Learning , 1993, NIPS.

[13]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.