Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms

A key aspect of artificial intelligence is the ability to learn from experience. If examples of correct solutions exist, supervised learning techniques can be used to predict what the correct solution will be for future observations. However, often such examples are not readily available. The field of reinforcement learning investigates methods that can learn from experience when no examples of correct behavior are given, but a reinforcement signal is supplied to the learning entity. Many problems fit this problem description. In games, the reinforcement signal might be whether or not the game was won. In economic settings, the reinforcement can represent the profit or loss that is eventually made. Furthermore, in robotics it is often easier to specify how well the robot is doing than it is to find examples of good behavior beforehand. An advantage of reinforcement learning is that the designer of the system need not know what good solutions to a problem may be. Rather, the system will find good solutions by trial and error. Of particular interest to us are model-free temporal-difference algorithms. These algorithms do not use experiences to build an explicit model of the environment, but construct an approximation of the expected value for each possible action. The values can then be used to construct solutions. These methods are computationally efficient, easy to implement and often find solutions quickly. Additionally, in many settings it is easier to find a good policy to select actions than to model the whole environment and then to use this model to try to determine what to do. In this dissertation, we analyze several existing model-free temporal-difference algorithms. We discuss some problems with these approaches, such as a potentially huge overestimation of the action values by the popular Q-learning algorithm. We discuss ways to prevent these issues and propose a number of new algorithms. We analyze the new algorithms and compare their performance on a number of tasks. We conclude that it depends highly on the characteristics of the problem which algorithm performs best. We give some indications on which algorithms are to be preferred in different problem settings. To solve problems with unknown characteristics, we propose using ensemble methods that combine action-selection policies of a number of different entities. We discuss several approaches to combine these policies and demonstrate empirically that good solutions can reliably be found. Additionally, we extend the idea of model-free temporal-difference algorithms to problems with continuous action spaces. In such problems, conventional approaches are not applicable, because they can not handle the infinite number of possible actions. We propose a new algorithm that is explicitly designed for continuous spaces and show that it compares favorably to the current state of the art.

[1]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[2]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[3]  S. Banach Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales , 1922 .

[4]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[5]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[6]  K. Arrow A Difficulty in the Concept of Social Welfare , 1950, Journal of Political Economy.

[7]  K. Arrow,et al.  Social Choice and Individual Values , 1951 .

[8]  Alfred De Grazia,et al.  Mathematical Derivation of an Election System , 1953 .

[9]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[10]  C. Coombs A theory of data. , 1965, Psychology Review.

[11]  C. E. Clark The Greatest of a Finite Set of Random Variables , 1961 .

[12]  G. Thompson,et al.  The Theory of Committees and Elections. , 1959 .

[13]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[14]  John H. Holland,et al.  Outline for a Logical Theory of Adaptive Systems , 1962, JACM.

[15]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[16]  S. Gupta,et al.  ORDER STATISTICS ARISING FROM INDEPENDENT BINOMIAL POPULATIONS , 1967 .

[17]  Arthur E. Bryson,et al.  Applied Optimal Control , 1969 .

[18]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[19]  Evan L. Porteus Some Bounds for Discounted Sequential Decision Processes , 1971 .

[20]  E. C. Capen,et al.  Competitive Bidding in High-Risk Situations , 1971 .

[21]  J. Albus A Theory of Cerebellar Function , 1971 .

[22]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[23]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[24]  J. H. Smith AGGREGATION OF PREFERENCES WITH VARIABLE ELECTORATE , 1973 .

[25]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[26]  Paramesh Ray Independence of Irrelevant Alternatives , 1973 .

[27]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[28]  Kumpati S. Narendra,et al.  Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[29]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[30]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[31]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[32]  Jo van Nunen,et al.  A set of successive approximation methods for discounted Markovian decision problems , 1976, Math. Methods Oper. Res..

[33]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[34]  K. Appel,et al.  Every planar map is four colorable. Part II: Reducibility , 1977 .

[35]  P. Fishburn Condorcet Social Choice Functions , 1977 .

[36]  J. Wal Discounted Markov games: Generalized policy iteration method , 1978 .

[37]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[38]  B. Arnold,et al.  Bounds on Expectations of Linear Systematic Statistics Based on Dependent Samples , 1979 .

[39]  Philip D. Straffin,et al.  Topics in the theory of voting , 1980 .

[40]  Peter C. Fishburn,et al.  Monotonicity paradoxes in the theory of elections , 1982, Discret. Appl. Math..

[41]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[42]  Scott Kirkpatrick,et al.  Optimization by simulated annealing: Quantitative studies , 1984 .

[43]  R. Niemi The Problem of Strategic Behavior under Approval Voting , 1984, American Political Science Review.

[44]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[45]  T. Aven Upper (lower) bounds on the mean of the maximum (minimum) of a number of random variables , 1985, Journal of Applied Probability.

[46]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[47]  Barak A. Pearlmutter,et al.  G-maximization: An unsupervised learning procedure for discovering regularities , 1987 .

[48]  Some philosophical problems from the standpoint of ai , 1987 .

[49]  T. Tideman,et al.  Independence of clones as a criterion for voting rules , 1987 .

[50]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[51]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[52]  D. Saari,et al.  The problem of indeterminacy in approval, multiple, and truncated voting systems , 1988 .

[53]  R. Thaler Anomalies: The Winner's Curse , 1988 .

[54]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[55]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[56]  C. Watkins Learning from delayed rewards , 1989 .

[57]  C.W. Anderson,et al.  Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[58]  P. J. Werbos,et al.  Backpropagation and neurocontrol: a review and prospectus , 1989, International 1989 Joint Conference on Neural Networks.

[59]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[60]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[61]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[62]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[63]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[64]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[65]  Daniel C. Dennett,et al.  Cognitive Wheels: The Frame Problem of AI , 1990, The Philosophy of Artificial Intelligence.

[66]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[67]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[68]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[69]  A. P. Wieland,et al.  Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[70]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[71]  Hyongsuk Kim,et al.  CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[72]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[73]  Satinder P. Singh,et al.  The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[74]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[75]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[76]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[77]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[78]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[79]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[80]  Lambert Schomaker,et al.  Using stroke- or character-based self-organizing maps in the recognition of on-line, connected cursive script , 1993, Pattern Recognit..

[81]  Thomas Bäck,et al.  An Overview of Evolutionary Algorithms for Parameter Optimization , 1993, Evolutionary Computation.

[82]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[83]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[84]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[85]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[86]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[87]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[88]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[89]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[90]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[91]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[92]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[93]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[94]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[95]  Neil J. Calkin A curious binomial identity , 1994, Discret. Math..

[96]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[97]  N. Papadatos Maximum variance of order statistics , 1995 .

[98]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[99]  Chen K. Tham,et al.  Reinforcement learning of multiple tasks using a hierarchical CMAC architecture , 1995, Robotics Auton. Syst..

[100]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[101]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[102]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[103]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[104]  N. Tideman The Single Transferable Vote , 1995 .

[105]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[106]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[107]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[108]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[109]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[110]  Arthur C. Graesser,et al.  Is it an Agent, or Just a Program?: A Taxonomy for Autonomous Agents , 1996, ATAL.

[111]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[112]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[113]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[114]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[115]  Thomas Bäck,et al.  Evolutionary algorithms in theory and practice - evolution strategies, evolutionary programming, genetic algorithms , 1996 .

[116]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[117]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[118]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning through Symbiotic Evolution , 1996, Machine Learning.

[119]  Judy A. Franklin,et al.  Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[120]  Claude F. Touzet,et al.  Neural reinforcement learning for behaviour synthesis , 1997, Robotics Auton. Syst..

[121]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[122]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[123]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[124]  G. Saridis,et al.  Approximate Solutions to the Time-Invariant Hamilton–Jacobi–Bellman Equation , 1998 .

[125]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[126]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[127]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[128]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[129]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[130]  Robin R. Murphy,et al.  Artificial intelligence and mobile robots: case studies of successful robot systems , 1998 .

[131]  T. Kohonen,et al.  Visual Explorations in Finance with Self-Organizing Maps , 1998 .

[132]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[133]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[134]  Ron Sun,et al.  Multi-agent reinforcement learning: weighting and partitioning , 1999, Neural Networks.

[135]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[136]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[137]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[138]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[139]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[140]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[141]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[142]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[143]  Alexander Zelinsky,et al.  Q-Learning in Continuous State and Action Spaces , 1999, Australian Joint Conference on Artificial Intelligence.

[144]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[145]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[146]  R. Richie,et al.  Instant Runoffs: A Cheaper, Fairer, Better Way to Conduct Elections , 2000 .

[147]  Josef Kittler,et al.  Combining multiple classifiers by averaging or by multiplying? , 2000, Pattern Recognit..

[148]  Lambert Schomaker,et al.  Variants of the Borda count method for combining ranked classifier hypotheses , 2000 .

[149]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[150]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[151]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[152]  D. Farrell Electoral Systems: A Comparative Introduction , 2001 .

[153]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[154]  Rémi Coulom,et al.  Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[155]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[156]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[157]  George G. Lendaris,et al.  Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[158]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[159]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[160]  Tony R. Martinez,et al.  The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[161]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[162]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[163]  A. E. Eiben,et al.  Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[164]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[165]  L. D. Whitley Genetic reinforcement learning for neurocontrol problems , 1993, Machine Learning.

[166]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[167]  B. Grofman,et al.  If you like the alternative vote (a.k.a. the instant runoff), then you ought to know about the Coombs rule , 2004 .

[168]  D. Ernst,et al.  Power systems stability control: reinforcement learning framework , 2004, IEEE Transactions on Power Systems.

[169]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[170]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[171]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[172]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[173]  E. Steen Rational Overoptimism (and Other Biases) , 2004 .

[174]  Jürgen Schmidhuber,et al.  Fast Online Q(λ) , 1998, Machine Learning.

[175]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[176]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[177]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[178]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[179]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[180]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[181]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[182]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[183]  Axel Wismüller,et al.  Tumor feature visualization with unsupervised learning , 2005, Medical Image Anal..

[184]  Peter Stone,et al.  Function Approximation via Tile Coding: Automating Parameter Choice , 2005, SARA.

[185]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[186]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[187]  Ashutosh Saxena,et al.  High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[188]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[189]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[190]  SRIDHAR MAHADEVAN,et al.  Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results , 2005, Machine Learning.

[191]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[192]  Marco Wiering QV(λ)-learning: A New On-policy Reinforcement Learning Algorithm , 2005 .

[193]  Mohammad Bagher Naghibi-Sistani,et al.  Application of Q-learning with temperature variation for bidding strategies in market based power systems , 2006 .

[194]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[195]  Robert L. Winkler,et al.  The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..

[196]  L. Buşoniu Evolutionary function approximation for reinforcement learning , 2006 .

[197]  Shimon Whiteson,et al.  Comparing evolutionary and temporal difference methods in a reinforcement learning domain , 2006, GECCO.

[198]  Mohamed S. Kamel,et al.  Aggregation of Reinforcement Learning Algorithms , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[199]  Kart Hik Nat,et al.  TIGHT BOUNDS ON EXPECTED ORDER STATISTICS , 2006 .

[200]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[201]  Shimon Whiteson,et al.  Empirical Studies in Action Selection with Reinforcement Learning , 2007, Adapt. Behav..

[202]  H. Robbins A Stochastic Approximation Method , 1951 .

[203]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[204]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[205]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[206]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[207]  Deepayan Chakrabarti,et al.  Multi-armed bandit problems with dependent arms , 2007, ICML '07.

[208]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[209]  M.A. Wiering,et al.  Convergence of Model-Based Temporal Difference Learning for Control , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[210]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[211]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[212]  M.A. Wiering,et al.  Two Novel On-policy Reinforcement Learning Algorithms based on TD(λ)-methods , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[213]  Kary Främling Replacing eligibility trace for action-value learning with function approximation , 2007, ESANN.

[214]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[215]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[216]  K. Conn,et al.  Towards Affect-sensitive Assistive Intervention Technologies for Children with Autism , 2008, RO-MAN 2008.

[217]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[218]  Marco Wiering,et al.  Ensemble Algorithms in Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[219]  B. Arnold,et al.  A first course in order statistics , 2008 .

[220]  Changchun Liu,et al.  Online Affect Detection and Robot Behavior Adaptation for Intervention of Children With Autism , 2008, IEEE Transactions on Robotics.

[221]  Frank Dignum,et al.  On-line adapting games using agent organizations , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[222]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[223]  Christian Igel,et al.  Evolution Strategies for Direct Policy Search , 2008, PPSN.

[224]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[225]  Marc Schoenauer,et al.  Supervised and Evolutionary Learning of Echo State Networks , 2008, PPSN.

[226]  Huaguang Zhang,et al.  Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[227]  Tom Schaul,et al.  Efficient natural evolution strategies , 2009, GECCO.

[228]  Marco Wiering,et al.  The QV family compared to other reinforcement learning algorithms , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[229]  Frank Dignum,et al.  Adaptive Serious Games Using Agent Organizations , 2009, AGS.

[230]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[231]  Frank L. Lewis,et al.  Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[232]  Lihong Li,et al.  Workshop summary: Results of the 2009 reinforcement learning competition , 2009, ICML '09.

[233]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[234]  Marco Wiering,et al.  Using continuous action spaces to solve discrete problems , 2009, 2009 International Joint Conference on Neural Networks.

[235]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[236]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[237]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[238]  Michail G. Lagoudakis,et al.  Binary action search for learning continuous-action control policies , 2009, ICML '09.

[239]  Tom Schaul,et al.  Exponential natural evolution strategies , 2010, GECCO '10.

[240]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[241]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[242]  Steffen Udluft,et al.  Ensembles of Neural Networks for Robust Reinforcement Learning , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[243]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[244]  Andrew M. Ross Computing Bounds on the Expected Maximum of Correlated Normal Variables , 2010 .

[245]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[246]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[247]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[248]  Donald Michie,et al.  BOXES: AN EXPERIMENT IN ADAPTIVE CONTROL , 2013 .