论文信息 - Learning optimal values from random walk

Learning optimal values from random walk

In this paper we extend the random walk example of Sutton and Barto (1998) to a multistage dynamic programming optimization setting with discounted reward. Using Bellman equations on presumed action, the optimal values are derived for general transition probability rho and discount rate gamma, and include the original random walk as a special case. Temporal difference methods with eligibility traces, TD(A), are effective in predicting the optimal values for different rho and gamma; but their performances are found to depend critically on the choice of truncated return in the formulation when gamma is less than 1

K. P. Lam

[1] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[2] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[4] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[5] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[6] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[7] A. Redish,et al. Addiction as a Computational Process Gone Awry , 2004, Science.

[8] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[9] Gert Cauwenberghs,et al. Analog VLSI Stochastic Perturbative Learning Architectures , 1997 .

[10] J. Nazuno. Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .