Algorithms for Reinforcement Learning
暂无分享,去创建一个
[1] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .
[2] J. Albus. A Theory of Cerebellar Function , 1971 .
[3] James S. Albus,et al. Brains, behavior, and robotics , 1981 .
[4] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.
[5] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .
[6] S. Thomas Alexander,et al. Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.
[7] C. Watkins. Learning from delayed rewards , 1989 .
[8] Christian M. Ernst,et al. Multi-armed Bandit Allocation Indices , 1989 .
[9] John N. Tsitsiklis,et al. The complexity of dynamic programming , 1989, J. Complex..
[10] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .
[11] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .
[12] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.
[13] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .
[14] W. Härdle. Applied Nonparametric Regression , 1992 .
[15] Pavel Brazdil,et al. Proceedings of the European Conference on Machine Learning , 1993 .
[16] Michael L. Littman,et al. Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.
[17] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.
[18] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.
[19] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .
[20] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .
[21] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.
[22] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[23] John Rust. Using Randomization to Break the Curse of Dimensionality , 1997 .
[24] Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.
[25] Matthias Heger,et al. Consideration of Risk in Reinforcement Learning , 1994, ICML.
[26] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.
[27] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.
[28] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .
[29] Steven J. Bradtke,et al. Incremental dynamic programming for on-line adaptive optimal control , 1995 .
[30] Gavin Adrian Rummery. Problem solving with reinforcement learning , 1995 .
[31] Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.
[32] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.
[33] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..
[34] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.
[35] Piotr Berman,et al. On-line Searching and Navigation , 1996, Online Algorithms.
[36] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[37] S. Ioffe,et al. Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .
[38] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.
[39] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.
[40] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.
[41] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .
[42] Csaba Szepesvári,et al. Learning and Exploitation Do Not Conflict Under Minimax Optimality , 1997, ECML.
[43] Dimitri P. Bertsekas,et al. Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .
[44] V. Borkar. Stochastic approximation with two time scales , 1997 .
[45] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.
[46] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[47] Csaba Szepesvari. Static and Dynamic Aspects of Optimal Sequential Decision Making , 1998 .
[48] Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.
[49] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.
[50] V. Borkar. Asynchronous Stochastic Approximations , 1998 .
[51] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..
[52] Carlos Domingo,et al. Faster Near-Optimal Reinforcement Learning: Adding Adaptiveness to the E3 Algorithm , 1999, ALT.
[53] John N. Tsitsiklis,et al. Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.
[54] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.
[55] John N. Tsitsiklis,et al. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..
[56] Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.
[57] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.
[58] Csaba Szepesvári,et al. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.
[59] Nicol N. Schraudolph,et al. Local Gain Adaptation in Stochastic Gradient Descent , 1999 .
[60] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
[61] J. W. Nieuwenhuis,et al. Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .
[62] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.
[63] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.
[64] John N. Tsitsiklis,et al. Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.
[65] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.
[66] Csaba Szepesvári,et al. Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..
[67] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..
[68] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.
[69] Marcus Hutter,et al. Towards a Universal Theory of Artificial Intelligence Based on Algorithmic Probability and Sequential Decisions , 2000, ECML.
[70] Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.
[71] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[72] Koby Crammer,et al. Advances in Neural Information Processing Systems 14 , 2002 .
[73] H. He,et al. Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..
[74] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
[75] Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.
[76] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.
[77] Sean P. Meyn,et al. Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..
[78] G. Wahba. REPRODUCING KERNEL HILBERT SPACES - TWO BRIEF REVIEWS , 2003 .
[79] Carl E. Rasmussen,et al. Gaussian Processes in Reinforcement Learning , 2003, NIPS.
[80] Stefan Schaal,et al. Reinforcement Learning for Humanoid Robotics , 2003 .
[81] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.
[82] A. Shapiro. Monte Carlo Sampling Methods , 2003 .
[83] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .
[84] John Langford,et al. Exploration in Metric State Spaces , 2003, ICML.
[85] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..
[86] Dimitri P. Bertsekas,et al. Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..
[87] Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..
[88] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..
[89] Abhijit Gosavi,et al. Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .
[90] Peter Dayan,et al. Q-learning , 1992, Machine Learning.
[91] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.
[92] Abhijit Gosavi,et al. Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..
[93] Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.
[94] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.
[95] Naoki Abe,et al. Cross channel optimized marketing by reinforcement learning , 2004, KDD.
[96] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.
[97] Yishay Mansour,et al. Experts in a Markov Decision Process , 2004, NIPS.
[98] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.
[99] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.
[100] Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.
[101] Benjamin Van Roy,et al. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..
[102] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.
[103] V. B. Tadic,et al. On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.
[104] John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.
[105] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.
[106] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.
[107] A. Barto,et al. Improved Temporal Difference Methods with Linear Function Approximation , 2004 .
[108] William D. Smart,et al. Interpolation-based Q-learning , 2004, ICML.
[109] Amos Storkey,et al. Advances in Neural Information Processing Systems 20 , 2007 .
[110] Marcus Hutter. Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.
[111] Christopher K. I. Williams,et al. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .
[112] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.
[113] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..
[114] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.
[115] Shie Mannor,et al. Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..
[116] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[117] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[118] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.
[119] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.
[120] Pieter Abbeel,et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.
[121] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.
[122] Prasad Tadepalli,et al. Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery , 2006, ECML.
[123] Andrew G. Barto,et al. An intrinsic reward mechanism for efficient exploration , 2006, ICML.
[124] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.
[125] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.
[126] Shie Mannor,et al. Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.
[127] Benjamin Van Roy,et al. A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..
[128] Alborz Geramifard,et al. iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.
[129] Johannes Fürnkranz,et al. Proceedings of the 17th European conference on Machine Learning , 2006 .
[130] R. Sutton. Gain Adaptation Beats Least Squares , 2006 .
[131] Warren B. Powell,et al. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.
[132] Benjamin Van Roy. Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..
[133] P. Glynn,et al. Opportunities and challenges in using online preference data for vehicle pricing: A case study at General Motors , 2006 .
[134] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.
[135] Xi-Ren Cao,et al. Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..
[136] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.
[137] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.
[138] Tao Wang,et al. Stable Dual Dynamic Programming , 2007, NIPS.
[139] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.
[140] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .
[141] Mohammad Ghavamzadeh,et al. Bayesian actor-critic algorithms , 2007, ICML '07.
[142] Peter Stone,et al. Model-Based Exploration in Continuous State Spaces , 2007, SARA.
[143] Michael L. Littman,et al. Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.
[144] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.
[145] Michael C. Fu,et al. An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.
[146] H. Robbins. Some aspects of the sequential design of experiments , 1952 .
[147] Warren B. Powell,et al. Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .
[148] Richard S. Sutton,et al. Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.
[149] D. Bertsekas,et al. Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).
[150] Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.
[151] Zoubin Ghahramani,et al. Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.
[152] Sanjoy Dasgupta,et al. Random projection trees and low dimensional manifolds , 2008, STOC.
[153] Joelle Pineau,et al. Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..
[154] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[155] András Lörincz,et al. The many faces of optimism: a unifying approach , 2008, ICML '08.
[156] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.
[157] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.
[158] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.
[159] Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.
[160] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..
[161] Csaba Szepesvári,et al. Empirical Bernstein stopping , 2008, ICML '08.
[162] L. Sherry,et al. Estimating Taxi-out times with a reinforcement learning algorithm , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.
[163] M. Kosorok. Introduction to Empirical Processes and Semiparametric Inference , 2008 .
[164] Dimitri P. Bertsekas,et al. New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
[165] William W. Cohen,et al. Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.
[166] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.
[167] Shie Mannor,et al. Reinforcement learning in the presence of rare events , 2008, ICML '08.
[168] Marc Toussaint,et al. Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.
[169] Michael L. Littman,et al. Multi-resolution Exploration in Continuous Spaces , 2008, NIPS.
[170] Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.
[171] Shie Mannor,et al. Regularized Policy Iteration , 2008, NIPS.
[172] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..
[173] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..
[174] Gavin Taylor,et al. Kernelized value function approximation for reinforcement learning , 2009, ICML '09.
[175] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.
[176] Warren B. Powell,et al. An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem , 2009, Math. Oper. Res..
[177] Sridhar Mahadevan,et al. Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..
[178] Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.
[179] Shie Mannor,et al. Regularized Fitted Q-iteration: Application to Planning , 2008, EWRL.
[180] Dale Schuurmans,et al. Learning Exercise Policies for American Options , 2009, AISTATS.
[181] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.
[182] Brian Tanner,et al. RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..
[183] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..
[184] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..
[185] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.
[186] Warren B. Powell,et al. An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application , 2009, Transp. Sci..
[187] C. Lemieux. Monte Carlo and Quasi-Monte Carlo Sampling , 2009 .
[188] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.
[189] Bruno Scherrer,et al. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.
[190] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.
[191] Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.
[192] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.
[193] Ronald Ortner,et al. Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.
[194] Csaba Szepesvari,et al. The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.
[195] Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .
[196] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .
[197] Peter A. Flach,et al. Proceedings of the 28th International Conference on Machine Learning , 2011 .
[198] Kevin D. Glazebrook,et al. Multi-Armed Bandit Allocation Indices: Gittins/Multi-Armed Bandit Allocation Indices , 2011 .
[199] Ferenc Beleznay,et al. Comparing Value-Function Estimation Algorithms in Undiscounted Problems , 2012 .
[200] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .