论文信息 - Reinforcement Learning in Continuous State and Action Spaces

Reinforcement Learning in Continuous State and Action Spaces

Many traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action spaces, which can make learning a good decision policy even more involved. In this chapter we discuss how to automatically find good decision policies in continuous domains. Because analytically computing a good policy from a continuous model can be infeasible, in this chapter we mainly focus on methods that explicitly update a representation of a value function, a policy or both. We discuss considerations in choosing an appropriate representation for these functions and discuss gradient-based and gradient-free ways to update the parameters. We show how to apply these methods to reinforcement-learning problems and discuss many specific algorithms. Amongst others, we cover gradient-based temporal-difference learning, evolutionary strategies, policy-gradient algorithms and (natural) actor-critic methods. We discuss the advantages of different approaches and compare the performance of a state-of-the-art actor-critic method and a state-of-the-art evolutionary strategy empirically.

Hado van Hasselt | Hado van Hasselt | Hado Philip van Hasselt

[1] Shimon Whiteson,et al. Evolutionary Function Approximation for Reinforcement Learning , 2006, J. Mach. Learn. Res..

[2] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[3] Michail G. Lagoudakis,et al. Binary action search for learning continuous-action control policies , 2009, ICML '09.

[4] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[5] Scott Kirkpatrick,et al. Optimization by simulated annealing: Quantitative studies , 1984 .

[6] M. J. D. Powell,et al. UOBYQA: unconstrained optimization by quadratic approximation , 2002, Math. Program..

[7] L. D. Whitley,et al. Genetic Reinforcement Learning for Neurocontrol Problems , 2004, Machine Learning.

[8] Lawrence. Davis,et al. Handbook Of Genetic Algorithms , 1990 .

[9] Charles W. Anderson,et al. Learning to Control an Inverted Pendulum with Connectionist Networks , 1988, 1988 American Control Conference.

[10] John H. Holland,et al. Outline for a Logical Theory of Adaptive Systems , 1962, JACM.

[11] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[13] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[14] Andrew G. Barto,et al. Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[15] R. Bellman. Dynamic programming. , 1957, Science.

[16] Risto Miikkulainen,et al. Efficient Reinforcement Learning through Symbiotic Evolution , 2004 .

[17] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18] Marc Schoenauer,et al. Supervised and Evolutionary Learning of Echo State Networks , 2008, PPSN.

[19] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[20] Risto Miikkulainen,et al. Efficient Reinforcement Learning Through Evolving Neural Network Topologies , 2002, GECCO.

[21] Vivek S. Borkar,et al. Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[22] Bruno Scherrer,et al. Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[23] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[24] J. Albus. A Theory of Cerebellar Function , 1971 .

[25] Martin Berggren,et al. Hybrid differentiation strategies for simulation and analysis of applications in C++ , 2008, TOMS.

[26] I ScottKirkpatrick. Optimization by Simulated Annealing: Quantitative Studies , 1984 .

[27] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[28] Rémi Coulom,et al. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. (Apprentissage par renforcement utilisant des réseaux de neurones, avec des applications au contrôle moteur) , 2002 .

[29] Andrea Bonarini. Delayed Reinforcement , Fuzzy Q-Learning and Fuzzy Logic Controllers , 1996 .

[30] Tony R. Martinez,et al. The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[31] M. Powell. The NEWUOA software for unconstrained optimization without derivatives , 2006 .

[32] P. J. Werbos,et al. Backpropagation and neurocontrol: a review and prospectus , 1989, International 1989 Joint Conference on Neural Networks.

[33] Tom Schaul,et al. Efficient natural evolution strategies , 2009, GECCO.

[34] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[35] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[36] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[37] Francisco Herrera,et al. Genetic Algorithms and Soft Computing , 1996 .

[38] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[39] Lionel Jouffe,et al. Fuzzy inference system learning by reinforcement methods , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[40] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[41] A. E. Eiben,et al. Introduction to Evolutionary Computing , 2003, Natural Computing Series.

[42] Petros Koumoutsakos,et al. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[43] Warren B. Powell,et al. Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[44] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[45] G. Saridis,et al. Approximate Solutions to the Time-Invariant Hamilton–Jacobi–Bellman Equation , 1998 .

[46] Alessandro Lazaric,et al. Finite-sample Analysis of Bellman Residual Minimization , 2010, ACML.

[47] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[48] Geoffrey J. Gordon,et al. Approximate solutions to markov decision processes , 1999 .

[49] J. N. Edwards,et al. Physical Violence Between Siblings A Theoretical and Empirical Analysis , 2005 .

[50] Risto Miikkulainen,et al. Accelerated Neural Evolution through Cooperatively Coevolved Synapses , 2008, J. Mach. Learn. Res..

[51] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[52] Vijay R. Konda,et al. OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[53] Shimon Whiteson,et al. Comparing evolutionary and temporal difference methods in a reinforcement learning domain , 2006, GECCO.

[54] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[55] Csaba Szepesv. Algorithms for Reinforcement Learning , 2010 .

[56] A. Barto,et al. Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[57] Jan Peters,et al. Model learning for robot control: a survey , 2011, Cognitive Processing.

[58] Ingo Rechenberg,et al. Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[59] Thomas Bäck,et al. An Overview of Evolutionary Algorithms for Parameter Optimization , 1993, Evolutionary Computation.

[60] William D. Smart,et al. Interpolation-based Q-learning , 2004, ICML.

[61] Thomas Bäck,et al. Evolutionary Algorithms in Theory and Practice , 1996 .

[62] Lotfi A. Zadeh,et al. Fuzzy Sets , 1996, Inf. Control..

[63] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[64] Huaguang Zhang,et al. Adaptive Dynamic Programming: An Introduction , 2009, IEEE Computational Intelligence Magazine.

[65] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[66] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[67] András Lörincz,et al. Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[68] Frank Sehnke,et al. Parameter-exploring policy gradients , 2010, Neural Networks.

[69] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[70] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[71] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[72] Nikos A. Vlassis,et al. Perseus: Randomized Point-based Value Iteration for POMDPs , 2005, J. Artif. Intell. Res..

[73] Robert Babuska,et al. Fuzzy Modeling for Control , 1998 .

[74] M. Kendall. Statistical Methods for Research Workers , 1937, Nature.

[75] Alborz Geramifard,et al. iLSTD: Eligibility Traces and Convergence Analysis , 2006, NIPS.

[76] Yoshua Bengio,et al. Pattern Recognition and Neural Networks , 1995 .

[77] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[78] Sebastian Thrun,et al. Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[79] Kumpati S. Narendra,et al. Learning automata - an introduction , 1989 .

[80] George G. Lendaris,et al. Adaptive dynamic programming , 2002, IEEE Trans. Syst. Man Cybern. Part C.

[81] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[82] Philipp Slusallek,et al. Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[83] Anne Auger,et al. Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009 , 2010, GECCO '10.

[84] Changjiu Zhou,et al. Dynamic balance of a biped robot using fuzzy reinforcement learning agents , 2003, Fuzzy Sets Syst..

[85] Nikolaus Hansen,et al. Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[86] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[87] Dirk P. Kroese,et al. The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[88] Kary Främling. Replacing eligibility trace for action-value learning with function approximation , 2007, ESANN.

[89] Jesse Hoey,et al. An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[90] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[91] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[92] A. P. Wieland,et al. Evolving neural network controllers for unstable systems , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[93] Christian Igel,et al. Evolution Strategies for Direct Policy Search , 2008, PPSN.

[94] Isao Ono,et al. Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies , 2010, PPSN.

[95] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[96] Riccardo Poli,et al. Particle swarm optimization , 1995, Swarm Intelligence.

[97] F. Glover,et al. Handbook of Metaheuristics , 2019, International Series in Operations Research & Management Science.

[98] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[99] Arthur E. Bryson,et al. Applied Optimal Control , 1969 .

[100] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[101] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[102] Claude F. Touzet,et al. Neural reinforcement learning for behaviour synthesis , 1997, Robotics Auton. Syst..

[103] Alexander Zelinsky,et al. Q-Learning in Continuous State and Action Spaces , 1999, Australian Joint Conference on Artificial Intelligence.

[104] Couette Viscometry Equation. APPROXIMATE SOLUTIONS FOR THE , 2005 .

[105] P. Y. Glorennec,et al. Fuzzy Q-learning and dynamical fuzzy Q-learning , 1994, Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference.

[106] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[107] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[108] Bart De Schutter,et al. Continuous-State Reinforcement Learning with Fuzzy Approximation , 2007, Adaptive Agents and Multi-Agents Systems.

[109] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[110] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[111] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[112] Stefan Schaal,et al. Reinforcement Learning for Humanoid Robotics , 2003 .

[113] Richard S. Sutton,et al. Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[114] M. Puterman,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[115] R. Fisher,et al. On the Mathematical Foundations of Theoretical Statistics , 1922 .

[116] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[117] C.W. Anderson,et al. Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[118] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[119] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[120] Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[121] John H. Holland,et al. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[122] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[123] Shalabh Bhatnagar,et al. Natural actorcritic algorithms. , 2009 .

[124] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[125] Judy A. Franklin,et al. Biped dynamic walking using reinforcement learning , 1997, Robotics Auton. Syst..

[126] Gerald Tesauro,et al. Practical issues in temporal difference learning , 1992, Machine Learning.

[127] E. S. Pearson,et al. ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[128] David S. Touretzky,et al. Proceedings of the 1993 Connectionist Models Summer School , 2014 .

[129] C. S. George Lee,et al. Reinforcement structure/parameter learning for neural-network-based fuzzy logic control systems , 1994, IEEE Trans. Fuzzy Syst..

[130] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[131] Lih-Yuan Deng,et al. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[132] Hans-Paul Schwefel,et al. Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[133] Marco Wiering,et al. The QV family compared to other reinforcement learning algorithms , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[134] George J. Klir,et al. Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[135] Andrew Y. Ng,et al. Policy Search via Density Estimation , 1999, NIPS.

[136] Camilla Nore,et al. A theoretical and empirical analysis , 2011 .

[137] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[138] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[139] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[140] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[141] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[142] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[143] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[144] Tom Schaul,et al. Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[145] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[146] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[147] James S. Albus,et al. New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[148] Thomas Bäck,et al. Evolutionary algorithms in theory and practice - evolution strategies, evolutionary programming, genetic algorithms , 1996 .

[149] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[150] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.

[151] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[152] Kumpati S. Narendra,et al. Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[153] W. Vent,et al. Rechenberg, Ingo, Evolutionsstrategie — Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 170 S. mit 36 Abb. Frommann‐Holzboog‐Verlag. Stuttgart 1973. Broschiert , 1975 .

[154] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[155] Arnold Neumaier,et al. SNOBFIT -- Stable Noisy Optimization by Branch and Fit , 2008, TOMS.

[156] Ashwin Ram,et al. Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[157] M. Bardi,et al. Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations , 1997 .

[158] L PutermanMartin,et al. Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[159] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[160] Dimitri P. Bertsekas,et al. Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[161] M. Dahleh. Laboratory for Information and Decision Systems , 2005 .

[162] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.

[163] T. Michael Knasel,et al. Robotics and autonomous systems , 1988, Robotics Auton. Syst..

[164] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[165] John J. Grefenstette,et al. Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..

[166] Tom Schaul,et al. Exponential natural evolution strategies , 2010, GECCO '10.

[167] Florentin Wörgötter,et al. Advances in Neural Information Processing Systems 16 (NIPS 2003) , 2004 .

[168] Paul J. Werbos,et al. Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[169] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[170] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[171] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[172] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[173] M.A. Wiering,et al. Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[174] Hamid R. Berenji,et al. Learning and tuning fuzzy logic controllers through reinforcements , 1992, IEEE Trans. Neural Networks.

[175] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[176] Paul J. Werbos,et al. Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[177] R. Rubinstein. The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[178] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[179] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[180] Frank L. Lewis,et al. Adaptive optimal control for continuous-time linear systems based on policy iteration , 2009, Autom..

[181] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[182] Leemon C Baird,et al. Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[183] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[184] Hyongsuk Kim,et al. CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[185] Zbigniew Michalewicz,et al. Evolutionary Computation 1 , 2018 .

[186] Shuqing Zeng,et al. Learning and tuning fuzzy logic controllers through genetic algorithm , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[187] H. R. Berenji,et al. Fuzzy Q-learning: a new approach for fuzzy dynamic programming , 1994, Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference.

[188] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[189] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[190] Eduardo Alonso. Multi-agent learning , 2007, Autonomous Agents and Multi-Agent Systems.

[191] Marco Wiering,et al. Using continuous action spaces to solve discrete problems , 2009, 2009 International Joint Conference on Neural Networks.

[192] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[193] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[194] Alborz Geramifard,et al. Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.