An upper bound on the loss from approximate optimal-value functions
暂无分享,去创建一个
[1] Evan L. Porteus. Some Bounds for Discounted Sequential Decision Processes , 1971 .
[2] Abraham Charnes,et al. Information Requirements for Urban Systems: A View into the Possible Future? , 1972 .
[3] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.
[4] Charles W. Anderson,et al. Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .
[5] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .
[6] Paul J. Werbos,et al. Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.
[7] C. Watkins. Learning from delayed rewards , 1989 .
[8] A. Barto,et al. Learning and Sequential Decision Making , 1989 .
[9] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.
[10] Steven J. Bradtke,et al. Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.
[11] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .
[12] Ronald J. Williams,et al. Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .
[13] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.
[14] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[15] Gerald Tesauro,et al. Practical issues in temporal difference learning , 1992, Machine Learning.
[16] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.
[17] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.