Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.
 Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .
 Learning from delayed rewards , 1989 .
 Stephen José Hanson,et al. In Advances in Neural Information Processing Systems , 1990, NIPS 1990.
 Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.
 Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .
 Pavel Brazdil,et al. Proceedings of the European Conference on Machine Learning , 1993 .
 Michael L. Littman,et al. Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.
 Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.
 Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.
 Michael I. Jordan,et al. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , 1993, Neural Computation.
 Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.
 Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994, Wiley Series in Probability and Statistics.
 Michael I. Jordan,et al. Reinforcement Learning with Soft State Aggregation , 1994, NIPS.
 Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.
 Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.
 Steven J. Bradtke,et al. Incremental dynamic programming for on-line adaptive optimal control , 1995 .
 Wei Zhang,et al. A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.
 Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 1996, Machine Learning.
 John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.
 Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.
 Csaba Szepesvári,et al. Learning and Exploitation Do Not Conflict Under Minimax Optimality , 1997, ECML.
 Thomas G. Dietterich. The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.
 Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..
 John N. Tsitsiklis,et al. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..
 Yishay Mansour,et al. Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.
 Csaba Szepesvári,et al. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.
 Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.
 Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.
 John N. Tsitsiklis,et al. Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.
 Csaba Szepesvári,et al. Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..
 Shie Mannor,et al. PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.
 John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
 Koby Crammer,et al. In Advances in Neural Information Processing Systems 14 , 2002 .
 H. He,et al. Efficient Reinforcement Learning Using Recursive Least-Squares Methods , 2011, J. Artif. Intell. Res..
 Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..
 Adam Krzyzak,et al. A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.
 Sean P. Meyn,et al. Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..
 G. Wahba. REPRODUCING KERNEL HILBERT SPACES - TWO BRIEF REVIEWS , 2003 .
 Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .
 John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..
 Dimitri P. Bertsekas,et al. Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..
 Benjamin Van Roy,et al. The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..
 Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2004, Machine Learning.
 Peter Stone,et al. Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.
 A. Barto,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.
 Naoki Abe,et al. Cross channel optimized marketing by reinforcement learning , 2004, KDD '04.
 John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.
 Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.
 Satinder Singh,et al. An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.
 Benjamin Van Roy,et al. On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming , 2004, Math. Oper. Res..
 Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.
 V. B. Tadic. On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.
 John N. Tsitsiklis,et al. Feature-based methods for large scale dynamic programming , 2004, Machine Learning.
 Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.
 Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.
 A. Barto,et al. Improved Temporal Difference Methods with Linear Function Approximation , 2004 .
 Amos Storkey,et al. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 16 , 2004, NIPS 2004.
 Marcus Hutter. Universal Artificial Intellegence - Sequential Decisions Based on Algorithmic Probability , 2005, Texts in Theoretical Computer Science. An EATCS Series.
 Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML '05.
 Shie Mannor,et al. Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..
 Richard S. Sutton,et al. Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.
 Pieter Abbeel,et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.
 Prasad Tadepalli,et al. Scaling Model-Based Average-Reward Reinforcement Learning for Product Delivery , 2006, ECML.
 Andrew G. Barto,et al. An intrinsic reward mechanism for efficient exploration , 2006, ICML.
 Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.
 Shie Mannor,et al. Automatic basis function construction for approximate dynamic programming and reinforcement learning , 2006, ICML.
 Benjamin Van Roy,et al. A Cost-Shaping Linear Program for Average-Cost Approximate Dynamic Programming with Performance Guarantees , 2006, Math. Oper. Res..
 Alborz Geramifard,et al. iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.
 Johannes Fürnkranz,et al. Proceedings of the 17th European conference on Machine Learning , 2006 .
 Warren B. Powell,et al. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.
 Benjamin Van Roy. Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..
 Xi-Ren Cao,et al. Stochastic Learning and Optimization - A Sensitivity-Based Approach , 2007 .
 Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.
 Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.
 Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .
 Peter Stone,et al. Model-Based Exploration in Continuous State Spaces , 2007, SARA.
 Michael L. Littman,et al. Online Linear Regression and Its Application to Model-Based Reinforcement Learning , 2007, NIPS.
 Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2007, Machine Learning.
 Michael C. Fu,et al. An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming , 2007, IEEE Transactions on Automatic Control.
 H. Robbins. SOME ASPECTS OF THE SEQUENTIAL DESIGN OF EXPERIMENTS , 2007 .
 Warren B. Powell,et al. Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .
 Richard S. Sutton,et al. Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.
 D. Bertsekas,et al. Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).
 Lihong Li,et al. Analyzing feature generation for value-function approximation , 2007, ICML '07.
 Zoubin Ghahramani,et al. Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.
 Sanjoy Dasgupta,et al. Random projection trees and low dimensional manifolds , 2008, STOC.
 Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
 Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.
 Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.
 Joelle Pineau,et al. Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.
 L. Sherry,et al. Estimating Taxi-out times with a reinforcement learning algorithm , 2008, 2008 IEEE/AIAA 27th Digital Avionics Systems Conference.
 Dimitri P. Bertsekas,et al. New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.
 William W. Cohen,et al. Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.
 V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .
 Shie Mannor,et al. Reinforcement learning in the presence of rare events , 2008, ICML '08.
 Marc Toussaint,et al. Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.
 Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.
 Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..
 Gavin Taylor,et al. Kernelized value function approximation for reinforcement learning , 2009, ICML '09.
 Warren B. Powell,et al. An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem , 2009, Math. Oper. Res..
 Sridhar Mahadevan,et al. Learning Representation and Control in Markov Decision Processes: New Frontiers , 2009, Found. Trends Mach. Learn..
 Andrew Y. Ng,et al. Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.
 Shie Mannor,et al. Regularized Fitted Q-iteration: Application to Planning , 2009, EWRL.
 Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.
 Brian Tanner,et al. RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..
 Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..
 Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..
 Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.
 Warren B. Powell,et al. An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application , 2009, Transp. Sci..
 Monte Carlo and Quasi-Monte Carlo Sampling , 2009 .
 Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.
 Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.
 Csaba Szepesvári,et al. Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.
 Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, AGI 2010.
 Ronald Ortner. Online regret bounds for Markov decision processes with deterministic transitions , 2010, Theor. Comput. Sci..
 The Online Loop-free Stochastic Shortest-Path Problem , 2010, COLT.
 Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .
 Peter A. Flach,et al. Proceedings of the 28th International Conference on Machine Learning , 2011 .
 Ferenc Beleznay,et al. Comparing Value-Function Estimation Algorithms in Undiscounted Problems , 2012 .