Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms
暂无分享,去创建一个
Tommi S. Jaakkola | Csaba Szepesvári | Michael L. Littman | Satinder P. Singh | Satinder Singh | Csaba Szepesvari | M. Littman | T. Jaakkola
[1] Pravin Varaiya,et al. Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .
[2] C. Watkins. Learning from delayed rewards , 1989 .
[3] Sebastian Thrun,et al. The role of exploration in learning control , 1992 .
[4] D. Sofge. THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .
[5] Donald A. Sofge,et al. Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .
[6] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .
[7] Andrew G. Barto,et al. Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.
[8] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.
[9] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .
[10] George H. John. When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.
[11] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .
[12] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[13] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..
[14] Thomas G. Dietterich,et al. High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.
[15] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.
[16] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .
[17] Gavin Adrian Rummery. Problem solving with reinforcement learning , 1995 .
[18] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.
[19] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..
[20] Thomas G. Dietterich,et al. High-Performance Job-Shop Scheduling With A Time-Delay TD-lambda Network , 1995, NIPS.
[21] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..
[22] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.
[23] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.
[24] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.
[25] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .
[26] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.
[27] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.
[28] Csaba Szepesv Ari,et al. Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .
[29] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .
[30] V. Borkar. Asynchronous Stochastic Approximations , 1998 .
[31] Richard Withey. The convergence of convergence , 2001, Aslib Proc..
[32] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[33] Terrence J. Sejnowski,et al. Exploration Bonuses and Dual Control , 1996, Machine Learning.
[34] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.
[35] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.
[36] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.
[37] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.
[38] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[39] Fernando Paganini,et al. IEEE Transactions on Automatic Control , 2006 .