Reinforcement Learning: An Introduction

Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

[1]  J. Stevens,et al.  Animal Intelligence , 2022, Nature.

[2]  E. Thorndike Animal intelligence: An experimental study of the associative processes in animals. , 1898 .

[3]  Adaptation , 1926 .

[4]  H. Blodgett,et al.  The effect of the introduction of reward upon the maze performance of rats , 1929 .

[5]  E. Tolman Purposive behavior in animals and men , 1932 .

[6]  C. L. Hull The goal-gradient hypothesis and maze learning. , 1932 .

[7]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[8]  G. Finch,et al.  Higher Order Conditioning with Constant Motivation , 1934 .

[9]  W. R. Thompson On the Theory of Apportionment , 1935 .

[10]  B. Skinner The Behavior of Organisms , 1991 .

[11]  R. Thouless Experimental Psychology , 1939, Nature.

[12]  K. J. Craik The nature of explanation , 1944 .

[13]  K. Spence The role of secondary reinforcement in delayed reward learning. , 1947 .

[14]  E. Tolman Cognitive maps in rats and men. , 1948, Psychological review.

[15]  Claude E. Shannon,et al.  Programming a computer for playing chess , 1950 .

[16]  C. Shannon A chess-playing machine. , 1950, Scientific American.

[17]  D. Thistlethwaite A critical review of latent learning and related experiments. , 1951, Psychological bulletin.

[18]  W. Walter A Machine that Learns , 1951 .

[19]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[20]  Some aspects of the sequential design of experiments , 1952 .

[21]  J. Deutsch A new type of behaviour theory. , 1953, British journal of psychology.

[22]  James L Olds,et al.  Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. , 1954, Journal of comparative and physiological psychology.

[23]  W. A. Clark,et al.  Simulation of self-organizing systems by digital computer , 1954, Trans. IRE Prof. Group Inf. Theory.

[24]  R. Bellman A PROBLEM IN THE SEQUENTIAL DESIGN OF EXPERIMENTS , 1954 .

[25]  J. Deutsch A Machine with Insight , 1954 .

[26]  D. Bernoulli Exposition of a New Theory on the Measurement of Risk , 1954 .

[27]  B. G. Farley,et al.  Generalization of pattern recognition in a self-organizing system , 1955, AFIPS '55 (Western).

[28]  Frederick Mosteller,et al.  Stochastic Models for Learning , 1956 .

[29]  E. Galanter,et al.  On thought: the extrinsic theory. , 1956, Psychological review.

[30]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[31]  R. Duncan Luce,et al.  Individual Choice Behavior , 1959 .

[32]  Jorge Nuno Silva,et al.  Mathematical Games , 1959, Nature.

[33]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[34]  K. Breland,et al.  The misbehavior of organisms. , 1961 .

[35]  G. Kimble,et al.  Hilgard and Marquis' Conditioning and learning , 1961 .

[36]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[37]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[38]  M. D. Egger,et al.  Secondary reinforcement in rats as a function of information value and reliability of the stimulus. , 1962, Journal of experimental psychology.

[39]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1962 .

[40]  Edward O. Thorp,et al.  Beat the Dealer: A Winning Strategy for the Game of Twenty-One , 1965 .

[41]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[42]  John H. Andreae,et al.  STELLA: A scheme for a learning machine , 1963 .

[43]  Norbert Wiener,et al.  God and Golem, inc. : a comment on certain points where cybernetics impinges on religion , 1964 .

[44]  K. Fu,et al.  A heuristic approach to reinforcement learning control systems , 1965 .

[45]  A. G. Butkovskiy,et al.  Optimal control of systems , 1966 .

[46]  J. Adler Chemotaxis in Bacteria , 1966, Science.

[47]  R. Bellman Dynamic programming. , 1957, Science.

[48]  Lawrence J. Fogel,et al.  Artificial Intelligence through Simulated Evolution , 1966 .

[49]  Arnold Griffith A New Machine-Learning Technique Applied to the Game of Checkers , 1966 .

[50]  G. Kimble Foundations of conditioning and learning , 1967 .

[51]  E. Denardo CONTRACTION MAPPINGS IN THE THEORY UNDERLYING DYNAMIC PROGRAMMING , 1967 .

[52]  L. Kamin Predictability, surprise, attention, and conditioning , 1967 .

[53]  E. Fischer Conditioned Reflexes , 1942, American journal of physical medicine.

[54]  James L. Melsa,et al.  State Functions and Linear Control Systems , 1967 .

[55]  J. M. Mendel,et al.  Applications of artificial intelligence techniques to a spacecraft control problem , 1967 .

[56]  L. Kamin Attention-like processes in classical conditioning , 1967 .

[57]  D. Shepard A two-dimensional interpolation function for irregularly-spaced data , 1968, ACM National Conference.

[58]  T. Crow Cortical Synapses and Reinforcement: a Hypothesis , 1968, Nature.

[59]  A. L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[60]  John H. Andreae,et al.  A learning machine with monologue , 1969 .

[61]  F. Downton Stochastic Approximation , 1969, Nature.

[62]  Reuben Hersh,et al.  Brownian Motion and Potential Theory , 1969 .

[63]  R. Herrnstein On the law of effect. , 1970, Journal of the experimental analysis of behavior.

[64]  King-Sun Fu,et al.  Learning control systems--Review and outlook , 1970 .

[65]  JOHN F. Young Machine Intelligence , 1971, Nature.

[66]  J. Albus A Theory of Cerebellar Function , 1971 .

[67]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[68]  R. Rescorla A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[69]  Bernard Widrow,et al.  Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[70]  J. Cross A Stochastic Learning Model of Economic Behavior , 1973 .

[71]  W. T. Powers Behavior, the control of perception , 1973 .

[72]  Ian H. Witten,et al.  Human operators and automatic adaptive controllers: A comparative study on a particular control task , 1973 .

[73]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[74]  M. L. Tsetlin,et al.  Automaton theory and modeling of biological systems , 1973 .

[75]  Arnold K. Griffith A Comparison and Evaluation of Three Machine Learning Procedures as Applied to the Game of Checkers , 1974, Artif. Intell..

[76]  Kumpati S. Narendra,et al.  Games of Stochastic Automata , 1974, IEEE Trans. Syst. Man Cybern..

[77]  E Harth,et al.  Alopex: a stochastic method for determining visual receptive fields. , 1974, Vision research.

[78]  Kumpati S. Narendra,et al.  Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[79]  John H. Holland,et al.  Adaptation in natural and artificial systems , 1975 .

[80]  D. Dennett Why the Law of Effect will not Go Away , 1975 .

[81]  N. Mackintosh A Theory of Attention: Variations in the Associability of Stimuli with Reinforcement , 1975 .

[82]  A. Harry Klopf,et al.  A comparison of natural and artificial intelligence , 1975, SGAR.

[83]  S. Grossberg A neural model of attention, reinforcement and discrimination learning. , 1975, International review of neurobiology.

[84]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[85]  James W. Daniel,et al.  Splines and efficiency in dynamic programming , 1976 .

[86]  I. Witten The apparent conflict between estimation and control—a survey of the two-armed bandit problem , 1976 .

[87]  Stephen A. Ritz,et al.  Distinctive features, categorical perception, and probability learning: some applications of a neural model , 1977 .

[88]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[89]  Carl V. Page,et al.  Heuristics for Signature Table Analysis as a Pattern Recognition Technique , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[90]  Averill M. Law,et al.  The art and theory of dynamic programming , 1977 .

[91]  John H. Andreae,et al.  Thinking with the teachable machine , 1977 .

[92]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[93]  Teuvo Kohonen,et al.  Associative memory. A system-theoretical approach , 1977 .

[94]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[95]  Ward Whitt,et al.  Approximations of Dynamic Programs, I , 1978, Math. Oper. Res..

[96]  Tom M. Mitchell,et al.  Models of Learning Systems. , 1979 .

[97]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[98]  J. Pearce,et al.  A model for Pavlovian learning: variations in the effectiveness of conditioned but not of unconditioned stimuli. , 1980, Psychological review.

[99]  JOHN W. Moore,et al.  Erratum to: Formation of attentional-associative networks in real time: Role of the hippocampus and implications for conditioning , 1980 .

[100]  J. W. Humberston Classical mechanics , 1980, Nature.

[101]  J. D. E. Koshland Bacterial chemotaxis as a model behavioral system , 1980 .

[102]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[103]  Christopher D. Adams,et al.  Instrumental Responding following Reinforcer Devaluation , 1981 .

[104]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[105]  Richard S. Sutton,et al.  Goal Seeking Components for Adaptive Intelligence: An Initial Assessment. , 1981 .

[106]  David Abrahamson,et al.  Contemporary Animal Learning Theory , 1981 .

[107]  A. Dickinson Conditioning and associative learning. , 1981, British medical bulletin.

[108]  Christopher D. Adams Variations in the Sensitivity of Instrumental Responding to Reinforcer Devaluation , 1982 .

[109]  K. Narendra,et al.  Learning Algorithms for Two-Person Zero-Sum Stochastic Games with Incomplete Information: A Unified Approach , 1982 .

[110]  Meaning and Purpose in the Intact Brain: A Philosophical, Psychological, and Biological Account of Conscious Processes , 1982 .

[111]  Lashon B. Booker,et al.  Intelligent behavior as an adaptation to the task environment ; Part II. , 1982 .

[112]  G. Monahan State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[113]  R. Sutton,et al.  Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element , 1982, Behavioural Brain Research.

[114]  Alan W. Biermann,et al.  Signature Table Systems and Learning , 1982, IEEE Transactions on Systems, Man, and Cybernetics.

[115]  Paul J. Werbos,et al.  Applications of advances in nonlinear sensitivity analysis , 1982 .

[116]  W. Levy,et al.  Temporal contiguity requirements for long-term associative potentiation/depression in the hippocampus , 1983, Neuroscience.

[117]  J. Staddon Adaptive behavior and learning , 1983 .

[118]  John S. Edwards,et al.  The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[119]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[120]  Steven Edward Hampson,et al.  A neural model of adaptive behavior , 1983 .

[121]  Lennart Ljung,et al.  Theory and Practice of Recursive Identification , 1983 .

[122]  R.M. Dunn,et al.  Brains, behavior, and robotics , 1983, Proceedings of the IEEE.

[123]  Dimitri P. Bertsekas,et al.  Distributed asynchronous computation of fixed points , 1983, Math. Program..

[124]  Kumpati S. Narendra,et al.  An N-player sequential stochastic game with identical payoffs , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[125]  Thomas G. Dietterich,et al.  The Role of the Critic in Learning Systems , 1984 .

[126]  E. Kandel,et al.  Is there a cell-biological alphabet for simple forms of learning? , 1984 .

[127]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[128]  Judea Pearl,et al.  Heuristics - intelligent search strategies for computer problem solving , 1984, Addison-Wesley series in artificial intelligence.

[129]  Peter G. Doyle,et al.  Random walks and electric networks , 1987, math/0001057.

[130]  Mark Derthick,et al.  Variations on the Boltzmann Machine Learning Algorithm , 1984 .

[131]  Oliver G. Selfridge,et al.  Some Themes and Primitives in Ill-Defined Systems , 1984 .

[132]  Peter C. Young,et al.  Recursive Estimation and Time Series Analysis , 1984 .

[133]  Graham C. Goodwin,et al.  Adaptive filtering prediction and control , 1984 .

[134]  P. Anandan,et al.  Pattern-recognizing stochastic learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[135]  A. Dickinson Actions and habits: the development of behavioural autonomy , 1985 .

[136]  D. J. White,et al.  Real Applications of Markov Decision Processes , 1985 .

[137]  Richard Wheeler,et al.  Decentralized learning in finite Markov chains , 1986, 1985 24th IEEE Conference on Decision and Control.

[138]  Richard S. Sutton,et al.  Training and Tracking in Robotics , 1985, IJCAI.

[139]  J. Hopfield,et al.  The Logic of Limax Learning , 1985 .

[140]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[141]  M. A. L. THATHACHAR,et al.  A new approach to the design of reinforcement schemes for learning automata , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[142]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[143]  Yann LeCun,et al.  Une procedure d'apprentissage pour reseau a seuil asymmetrique (A learning scheme for asymmetric threshold networks) , 1985 .

[144]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[145]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[146]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[147]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[148]  Hong Wang,et al.  Recursive estimation and time-series analysis , 1986, IEEE Trans. Acoust. Speech Signal Process..

[149]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[150]  P. S. Sastry,et al.  Estimator Algorithms for Learning Automata , 1986 .

[151]  Richard E. Korf,et al.  A Unified Theory of Heuristic Evaluation Functions and its Application to Learning , 1986, AAAI.

[152]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[153]  R. Sutton,et al.  Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals , 1986, Behavioural Brain Research.

[154]  Andrew G. Barto,et al.  Game-theoretic cooperativity in networks of self-interested units , 1987 .

[155]  Paul E. Utgoff,et al.  Learning to control a dynamic physical system , 1987, Comput. Intell..

[156]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[157]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[158]  T. Lai Adaptive treatment allocation and the multi-armed bandit problem , 1987 .

[159]  Ronald L. Rivest,et al.  Diversity-Based Inference of Finite Automata (Extended Abstract) , 1987, FOCS.

[160]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[161]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[162]  E. Kehoe,et al.  Temporal primacy overrides prior training in serial compound conditioning of the rabbit’s nictitating membrane response , 1987 .

[163]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[164]  D. J. White,et al.  Further Real Applications of Markov Decision Processes , 1988 .

[165]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[166]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[167]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[168]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[169]  Richard E. Korf,et al.  Optimal path-finding algorithms* , 1988 .

[170]  L. N. Kanal,et al.  The CDP: A unifying formulation for heuristic search, dynamic programming, and branch-and-bound , 1988 .

[171]  A. Klopf A neuronal model of classical conditioning , 1988 .

[172]  Philip E. Agre,et al.  The dynamic structure of everyday life , 1988 .

[173]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[174]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[175]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[176]  Pentti Kanerva,et al.  Sparse Distributed Memory , 1988 .

[177]  R. J. Williams,et al.  On the use of backpropagation in associative reinforcement learning , 1988, IEEE 1988 International Conference on Neural Networks.

[178]  Jonathan H. Connell,et al.  A colony architecture for an artificial creature , 1989 .

[179]  Andrew G. Barto,et al.  From Chemotaxis to cooperativity: abstract exercises in neuronal learning strategies , 1989 .

[180]  G. Klir IS THERE MORE TO UNCERTAINTY THAN SOME PROBABILITY THEORISTS MIGHT HAVE US BELIEVE , 1989 .

[181]  Paul J. Werbos,et al.  Neural networks for control and system identification , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[182]  Stephen Grossberg,et al.  Neural dynamics of adaptive timing and temporal discrimination during associative learning , 1989, Neural Networks.

[183]  Douglas A. Baxter,et al.  Computational Capabilities of Single Neurons: Relationship to Simple Forms of Associative and Nonassociative Learning in Aplysia , 1989 .

[184]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[185]  C. Watkins Learning from delayed rewards , 1989 .

[186]  John N. Tsitsiklis,et al.  Parallel and Distributed Computation: Numerical Methods , 1989 .

[187]  C.W. Anderson,et al.  Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[188]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[189]  J. W. Moore Learning and Sequential Decision Making , 1989 .

[190]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[191]  Ming Zhang,et al.  Comparisons of channel assignment strategies in cellular mobile telephone systems , 1989, IEEE International Conference on Communications, World Prosperity Through Communications,.

[192]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[193]  L. Baird,et al.  A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[194]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[195]  Paul E. Utgoff,et al.  Explaining Temporal Differences to Create Useful Concepts for Evaluating States , 1990, AAAI.

[196]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[197]  Geoffrey E. Hinton,et al.  Distributed Representations , 1990, The Philosophy of Artificial Intelligence.

[198]  Lyle H. Ungar,et al.  A bioreactor benchmark for adaptive network-based process control , 1990 .

[199]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[200]  W. Schultz,et al.  Dopamine neurons of the monkey midbrain: contingencies of responses to active touch during self-initiated arm movements. , 1990, Journal of neurophysiology.

[201]  R Ratcliff,et al.  Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. , 1990, Psychological review.

[202]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[203]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1989, NIPS.

[204]  W. Schultz,et al.  Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. , 1990, Journal of neurophysiology.

[205]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[206]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[207]  Thomas Dean,et al.  Toward learning time-varying functions with high input dimensionality , 1990, Proceedings. 5th IEEE International Symposium on Intelligent Control 1990.

[208]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[209]  Andrew G. Barto,et al.  Connectionist learning for control: an overview , 1990 .

[210]  David Chapman,et al.  What are plans for? , 1990, Robotics Auton. Syst..

[211]  Kumar N. Sivarajan,et al.  Dynamic channel assignment in cellular radio , 1990, 40th IEEE Conference on Vehicular Technology.

[212]  Geoffrey E. Hinton,et al.  A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[213]  Steven Minton,et al.  Quantitative Results Concerning the Utility of Explanation-based Learning , 1988, Artif. Intell..

[214]  Bruce Abramson,et al.  Expected-Outcome: A General Model of Static Evaluation , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[215]  Ming Tan Learning a Cost-Sensitive Internal Representation for Reinforcement Learning , 1991, ML.

[216]  Thomas Ross Machines who think. , 1933, Science.

[217]  Andrew G. Barto,et al.  On the Computational Economics of Reinforcement Learning , 1991 .

[218]  P. Parks,et al.  Improved Allocation of Weights for Associative Memory Storage in Learning Control Systems , 1991 .

[219]  J. Urgen Schmidhuber Adaptive Confidence and Adaptive Curiosity , 1991 .

[220]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[221]  W. Arthur Designing Economic Agents that Act Like Human Agents: A Behavioral Approach to Bounded Rationality , 1991 .

[222]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[223]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[224]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[225]  P. C. Parks,et al.  Design Improvements in Associative Memories for Cerebellar Model Articulation Controllers (CMAC) , 1991 .

[226]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[227]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .

[228]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[229]  Hyongsuk Kim,et al.  CMAC-based adaptive critic self-learning control , 1991, IEEE Trans. Neural Networks.

[230]  I. Gormezano,et al.  Second-order conditioning of the rabbit’s nictitating membrane response , 1991, Integrative physiological and behavioral science : the official journal of the Pavlovian Society.

[231]  Pak-Cheung Edgar An An improved multi-dimensional CMAC neural network: Receptive field function and placement , 1991 .

[232]  D.A. Handelman,et al.  Theory and development of higher-order CMAC neural networks , 1992, IEEE Control Systems.

[233]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[234]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[235]  Terrence J. Sejnowski,et al.  Using Aperiodic Reinforcement for Directed Self-Organization During Development , 1992, NIPS.

[236]  W. Schultz,et al.  Responses of monkey dopamine neurons during learning of behavioral reactions. , 1992, Journal of neurophysiology.

[237]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[238]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[239]  P Dayan,et al.  Expectation learning in the brain using diffuse ascending projections , 1992 .

[240]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[241]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[242]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[243]  Satinder P. Singh Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[244]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[245]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[246]  Andrew G. Barto,et al.  Shaping as a method for accelerating reinforcement learning , 1992, Proceedings of the 1992 IEEE International Symposium on Intelligent Control.

[247]  A. Karlsen [Selection by consequences]. , 1992, Tidsskrift for den Norske laegeforening : tidsskrift for praktisk medicin, ny raekke.

[248]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning with Less Data and Less Real Time , 1993 .

[249]  Paul M. B. Vitányi,et al.  Theories of learning , 1993 .

[250]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[251]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[252]  Etienne Barnard,et al.  Temporal-difference methods and Markov models , 1993, IEEE Trans. Syst. Man Cybern..

[253]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[254]  Andrew G. Barto,et al.  Monte Carlo Matrix Inversion and Reinforcement Learning , 1993, NIPS.

[255]  W. Schultz,et al.  Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[256]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[257]  Monte Zweben,et al.  Scheduling and rescheduling with iterative repair , 1993, IEEE Trans. Syst. Man Cybern..

[258]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[259]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[260]  Pentti Kanerva,et al.  Sparse distributed memory and related models , 1993 .

[261]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[262]  Satinder Singh,et al.  Learning to Solve Markovian Decision Processes , 1993 .

[263]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[264]  Stephen I. Gallant,et al.  Neural network learning and expert systems , 1993 .

[265]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[266]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[267]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[268]  Mark W. Spong,et al.  Swing up control of the Acrobot , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[269]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[270]  Joel L. Davis,et al.  A Model of How the Basal Ganglia Generate and Use Neural Signals That Predict Reinforcement , 1994 .

[271]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[272]  Michael I. Jordan,et al.  On the Convergence of Stochastic Iterative Dynamic Programming Algorithms , 1993, Neural Computation.

[273]  Mark W. Spong,et al.  Swinging up the Acrobot: an example of intelligent control , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[274]  Jude W. Shavlik,et al.  Incorporating Advice into Agents that Learn from Reinforcements , 1994, AAAI.

[275]  Judea Pearl,et al.  Counterfactual Probabilities: Computational Methods, Bounds and Applications , 1994, UAI.

[276]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[277]  K. P. Unnikrishnan,et al.  Alopex: A Correlation-Based Learning Algorithm for Feedforward and Recurrent Neural Networks , 1994, Neural Computation.

[278]  Terrence J. Sejnowski,et al.  A Novel Reinforcement Model of Birdsong Vocalization Learning , 1994, NIPS.

[279]  Chen-Khong Tham,et al.  Modular on-line function approximation for scaling up reinforcement learning , 1994 .

[280]  Marco Colombetti,et al.  Robot Shaping: Developing Autonomous Agents Through Learning , 1994, Artif. Intell..

[281]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[282]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[283]  S. Schaal,et al.  Robot juggling: implementation of memory-based learning , 1994, IEEE Control Systems.

[284]  Karl J. Friston,et al.  Value-dependent selection in the brain: Simulation in a synthetic neural model , 1994, Neuroscience.

[285]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[286]  Prasad Tadepalli,et al.  H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward , 1994 .

[287]  T. Sejnowski,et al.  The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. , 1994, Learning & memory.

[288]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[289]  Marco Colombetti,et al.  Training Agents to Perform Sequential Behavior , 1994, Adapt. Behav..

[290]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[291]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[292]  W. Estes Toward a Statistical Theory of Learning. , 1994 .

[293]  Paul J. Werbos,et al.  The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting , 1994 .

[294]  Jerry M. Mendel,et al.  Reinforcement-learning control and pattern recognition systems , 1994 .

[295]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[296]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[297]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[298]  Gary Cziko,et al.  Without Miracles: Universal Selection Theory and the Second Darwinian Revolution , 1995 .

[299]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[300]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[301]  Kenji Doya,et al.  Temporal Difference Learning in Continuous Time and Space , 1995, NIPS.

[302]  Richard S. Sutton,et al.  A Summary Comparison of CMAC Neural Network and Traditional Adaptive Control Systems , 1995 .

[303]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[304]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[305]  Thomas Dean,et al.  Decomposition Techniques for Planning in Stochastic Domains , 1995, IJCAI.

[306]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[307]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[308]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[309]  J. Pearl Causal diagrams for empirical research , 1995 .

[310]  Mandayam A. L. Thathachar,et al.  Local and Global Optimization Algorithms for Generalized Learning Automata , 1995, Neural Computation.

[311]  R. Agrawal Sample mean based index policies by O(log n) regret for the multi-armed bandit problem , 1995, Advances in Applied Probability.

[312]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[313]  A. Barto Adaptive Critics and the Basal Ganglia , 1995 .

[314]  Michael O. Duff,et al.  Q-Learning for Bandit Problems , 1995, ICML.

[315]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[316]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[317]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[318]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[319]  Steven J. Bradtke,et al.  Incremental dynamic programming for on-line adaptive optimal control , 1995 .

[320]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[321]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[322]  Learning and memory in the honeybee. , 1995, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[323]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[324]  Geoffrey J. Gordon Stable Fitted Reinforcement Learning , 1995, NIPS.

[325]  Jing Peng,et al.  Efficient Memory-Based Dynamic Programming , 1995, ICML.

[326]  Peter Dayan,et al.  Bee foraging in uncertain environments using predictive hebbian learning , 1995, Nature.

[327]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[328]  Truncating Temporal Diierences: on the Eecient Implementation of Td() for Reinforcement Learning , 1995 .

[329]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[330]  Leemon C. Baird Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[331]  A. Dickinson,et al.  Reward-related signals carried by dopamine neurons. , 1995 .

[332]  J. Wickens,et al.  Cellular models of reinforcement. , 1995 .

[333]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[334]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[335]  Richard S. Sutton,et al.  Model-Based Reinforcement Learning with an Approximate, Learned Model , 1996 .

[336]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[337]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 1996, Machine Learning.

[338]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[339]  A. Turing Intelligent Machinery, A Heretical Theory* , 1996 .

[340]  John Rust Numerical dynamic programming in economics , 1996 .

[341]  J. A. Bryson Optimal control-1950 to 1985 , 1996 .

[342]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[343]  Wei Zhang,et al.  Reinforcement learning for job shop scheduling , 1996 .

[344]  Andrew G. Barto,et al.  Large-scale dynamic optimization using teams of reinforcement learning agents , 1996 .

[345]  W. Thomas Miller,et al.  UNH_CMAC Version 2.1 The University of New Hampshire Implementation of the Cerebellar Model Arithmetic Computer - CMAC , 1996 .

[346]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[347]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[348]  Prasad Tadepalli,et al.  Scaling Up Average Reward Reinforcement Learning by Approximating the Domain Models and the Value Function , 1996, ICML.

[349]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[350]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[351]  John N. Tsitsiklis,et al.  Rollout Algorithms for Combinatorial Optimization , 1997, J. Heuristics.

[352]  A. Machado Learning the temporal dynamics of behavior. , 1997, Psychological review.

[353]  H. Markram,et al.  Regulation of Synaptic Efficacy by Coincidence of Postsynaptic APs and EPSPs , 1997, Science.

[354]  David S. Touretzky,et al.  Shaping robot behavior using principles from instrumental conditioning , 1997, Robotics Auton. Syst..

[355]  M. Hammer The neural basis of associative reward learning in honeybees , 1997, Trends in Neurosciences.

[356]  Gary Boone,et al.  Minimum-time control of the Acrobot , 1997, Proceedings of International Conference on Robotics and Automation.

[357]  U. Frey,et al.  Synaptic tagging and long-term potentiation , 1997, Nature.

[358]  J.N. Tsitsiklis,et al.  A neuro-dynamic programming approach to retailer inventory management , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[359]  J. Clouse On integrating apprentice learning and reinforcement learning TITLE2 , 1997 .

[360]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[361]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[362]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[363]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[364]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[365]  W. Schultz,et al.  Learning of sequential movements by neural network model with dopamine-like reinforcement signal , 1998, Experimental Brain Research.

[366]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[367]  T. Sejnowski,et al.  A Computational Model of Birdsong Learning by Auditory Experience and Auditory Feedback , 1998 .

[368]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[369]  R. Clark,et al.  Classical conditioning and brain systems: the role of awareness. , 1998, Science.

[370]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[371]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[372]  K. Berridge,et al.  What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience? , 1998, Brain Research Reviews.

[373]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[374]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[375]  W. Schultz,et al.  A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[376]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[377]  S. Grossberg,et al.  How the Basal Ganglia Use Parallel Excitatory and Inhibitory Learning Pathways to Selectively Respond to Unexpected Rewarding Cues , 1999, The Journal of Neuroscience.

[378]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[379]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[380]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[381]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[382]  Nicol N. Schraudolph Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[383]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[384]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[385]  Simon Haykin,et al.  A dynamic channel assignment policy through Q-learning , 1999, IEEE Trans. Neural Networks.

[386]  C. Buhusi,et al.  Timing in simple conditioning and occasion setting: a neural network approach , 1999, Behavioural Processes.

[387]  Hector Magno,et al.  Models of Learning , 1999 .

[388]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[389]  J. Donahoe,et al.  Behavior analysis and revaluation. , 2000, Journal of the experimental analysis of behavior.

[390]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[391]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[392]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[393]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[394]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[395]  Andrew Y. Ng,et al.  Algorithms for Inverse Reinforcement Learning , 2000, ICML.

[396]  Ryan,et al.  Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[397]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[398]  P. Montague,et al.  Predictability Modulates Human Brain Response to Reward , 2001, The Journal of Neuroscience.

[399]  Peter Redgrave,et al.  A computational model of action selection in the basal ganglia. II. Analysis and simulation of behaviour , 2001, Biological Cybernetics.

[400]  D. Kahneman,et al.  Functional Imaging of Neural Responses to Expectancy and Experience of Monetary Gains and Losses , 2001, Neuron.

[401]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[402]  Peter Dayan,et al.  Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems , 2001 .

[403]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[404]  Rajesh P. N. Rao,et al.  Spike-Timing-Dependent Hebbian Plasticity as Temporal Difference Learning , 2001, Neural Computation.

[405]  M. Arbib,et al.  Modeling functions of striatal dopamine modulation in learning and planning , 2001, Neuroscience.

[406]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[407]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[408]  Richard S. Sutton,et al.  Comparing Policy-Gradient Algorithms , 2001 .

[409]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[410]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[411]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[412]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[413]  Martin Müller,et al.  Computer Go , 2002, Artif. Intell..

[414]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2003, J. Mach. Learn. Res..

[415]  Gerald Tesauro,et al.  Programming backgammon using self-teaching neural nets , 2002, Artif. Intell..

[416]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[417]  David S. Touretzky,et al.  Timing and Partial Observability in the Dopamine System , 2002, NIPS.

[418]  M. Thathachar,et al.  Varieties of learning automata: an overview , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[419]  P. Montague,et al.  Activity in human ventral striatum locked to errors of reward prediction , 2002, Nature Neuroscience.

[420]  John N. J. Reynolds,et al.  Dopamine-dependent plasticity of corticostriatal synapses , 2002, Neural Networks.

[421]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[422]  P. Dayan Matters temporal , 2002, Trends in Cognitive Sciences.

[423]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[424]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[425]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[426]  Colin Camerer Behavioral Game Theory: Experiments in Strategic Interaction , 2003 .

[427]  Andrew Y. Ng,et al.  Shaping and policy search in reinforcement learning , 2003 .

[428]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[429]  P. Glimcher Decisions, Uncertainty, and the Brain: The Science of Neuroeconomics , 2003 .

[430]  H. Jaeger Discrete-time, discrete-valued observable operator models: a tutorial , 2003 .

[431]  N. Daw,et al.  A computational substrate for incentive salience , 2003, Trends in Neurosciences.

[432]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[433]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[434]  Karl J. Friston,et al.  Temporal Difference Models and Reward-Related Learning in the Human Brain , 2003, Neuron.

[435]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.

[436]  M. Thathachar,et al.  Networks of Learning Automata: Techniques for Online Stochastic Optimization , 2003 .

[437]  H. Seung Learning in Spiking Neural Networks by Reinforcement of Stochastic Synaptic Transmission , 2003, Neuron.

[438]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[439]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[440]  Eric Wiewiora,et al.  Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[441]  R. Wise Dopamine, learning and motivation , 2004, Nature Reviews Neuroscience.

[442]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[443]  Xiaohui Xie,et al.  Learning in neural networks by reinforcement of irregular spiking. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[444]  Thomas G. Dietterich,et al.  Explanation-Based Learning and Reinforcement Learning: A Unified View , 1997, Machine Learning.

[445]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[446]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[447]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[448]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2004, Machine Learning.

[449]  G. Peterson A day of great illumination: B. F. Skinner's discovery of shaping. , 2004, Journal of the experimental analysis of behavior.

[450]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[451]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[452]  A. Barto,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[453]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[454]  Ronald J. Williams Simple statistical gradient-following algorithms for connectionist reinforcement learning , 2004, Machine Learning.

[455]  Scott Rixner,et al.  Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[456]  José Luis Contreras-Vidal,et al.  A Predictive Reinforcement Model of Dopamine Neurons for Learning Approach Behavior , 1999, Journal of Computational Neuroscience.

[457]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[458]  Gerald Tesauro Practical issues in temporal difference learning , 2004, Machine Learning.

[459]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[460]  Nancy Forbes Imitation of Life , 2004 .

[461]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[462]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[463]  C. Glymour,et al.  A theory of causal learning in children: causal maps and Bayes nets. , 2004, Psychological review.

[464]  Nuttapong Chentanez,et al.  Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[465]  G. Tesauro,et al.  Simple neural models of classical conditioning , 1986, Biological Cybernetics.

[466]  C. Breazeal The Behavior System , 2004 .

[467]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[468]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[469]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[470]  R. Sutton,et al.  Synthesis of nonlinear control surfaces by a layered associative search network , 2004, Biological Cybernetics.

[471]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[472]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[473]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[474]  Richard S. Sutton,et al.  Landmark learning: An illustration of associative search , 1981, Biological Cybernetics.

[475]  A. Redish Addiction as a Computational Process Gone Awry , 2004, Science.

[476]  Jing Peng,et al.  Incremental multi-step Q-learning , 2004, Machine Learning.

[477]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[478]  Allen Newell,et al.  The problem of expensive chunks and its solution by restricting expressiveness , 1993, Machine Learning.

[479]  W. Pan,et al.  Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network , 2005, The Journal of Neuroscience.

[480]  Nicol N. Schraudolph,et al.  Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation , 2005, NIPS.

[481]  B. Skinner OPERANT BEHAVIOR , 2005 .

[482]  Doina Precup,et al.  Off-policy Learning with Options and Recognizers , 2005, NIPS.

[483]  W. Schultz,et al.  Adaptive Coding of Reward Value by Dopamine Neurons , 2005, Science.

[484]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[485]  Chrystopher L. Nehaniv,et al.  Empowerment: a universal agent-centric measure of control , 2005, 2005 IEEE Congress on Evolutionary Computation.

[486]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[487]  Richard S. Sutton,et al.  Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.

[488]  Peter Dayan,et al.  How fast to work: Response vigor, motivation and tonic dopamine , 2005, NIPS.

[489]  Jongho Kim,et al.  An RLS-Based Natural Actor-Critic Algorithm for Locomotion of a Two-Linked Robot Arm , 2005, CIS.

[490]  Charles R. Gallistel,et al.  Deconstructing the law of effect , 2005, Games Econ. Behav..

[491]  Geoffrey J. Gordon,et al.  Fast Exact Planning in Markov Decision Processes , 2005, ICAPS.

[492]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, ECML.

[493]  C. Padoa-Schioppa,et al.  Neurons in the orbitofrontal cortex encode economic value , 2006, Nature.

[494]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[495]  The short-latency dopamine signal: a role in discovering novel actions? , 2006, Nature Reviews Neuroscience.

[496]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[497]  H. Yin,et al.  The role of the basal ganglia in habit formation , 2006, Nature Reviews Neuroscience.

[498]  Michael Thielscher,et al.  General Game Playing , 2014, Künstliche Intell..

[499]  Michael J. Frank,et al.  Making Working Memory Work: A Computational Model of Learning in the Prefrontal Cortex and Basal Ganglia , 2006, Neural Computation.

[500]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[501]  P. Dayan,et al.  A normative perspective on motivation , 2006, Trends in Cognitive Sciences.

[502]  Peter Dayan,et al.  The misbehavior of value and the discipline of the will , 2006, Neural Networks.

[503]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[504]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[505]  P. Dayan,et al.  Tonic dopamine: opportunity costs and the control of response vigor , 2007, Psychopharmacology.

[506]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[507]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[508]  Aaron C. Courville,et al.  Bayesian theories of conditioning in a changing world , 2006, Trends in Cognitive Sciences.

[509]  E. C. O. N. Ometrica Prospect theory: an analysis of decision under risk — Source link , 2007 .

[510]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[511]  Razvan V. Florian,et al.  Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity , 2007, Neural Computation.

[512]  Peter Redgrave,et al.  Basal ganglia , 2007, Scholarpedia.

[513]  Xi-Ren Cao,et al.  Stochastic Learning and Optimization - A Sensitivity-Based Approach , 2007 .

[514]  PVLV: the primary value and learned value Pavlovian learning algorithm. , 2007, Behavioral neuroscience.

[515]  Paolo Calabresi,et al.  Dopamine-mediated regulation of corticostriatal synaptic plasticity , 2007, Trends in Neurosciences.

[516]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[517]  Robert A. Legenstein,et al.  Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity , 2007, NIPS.

[518]  R. Sutton On The Virtues of Linear Learning and Trajectory Distributions , 2007 .

[519]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[520]  J. O'Doherty,et al.  Determining the Neural Substrates of Goal-Directed Learning in the Human Brain , 2007, The Journal of Neuroscience.

[521]  E. Izhikevich Solving the distal reward problem through linkage of STDP and dopamine signaling , 2007, BMC Neuroscience.

[522]  Olle Gällmo,et al.  Reinforcement Learning by Construction of Hypothetical Targets , 2007 .

[523]  Adam Johnson,et al.  Neural Ensembles in CA3 Transiently Encode Paths Forward of the Animal at a Decision Point , 2007, The Journal of Neuroscience.

[524]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[525]  D. Hassabis,et al.  Deconstructing episodic memory with construction , 2007, Trends in Cognitive Sciences.

[526]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[527]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[528]  H. Seo,et al.  Dynamic signals related to choices and outcomes in the dorsolateral prefrontal cortex. , 2007, Cerebral cortex.

[529]  Ron Meir,et al.  Reinforcement Learning, Spike-Time-Dependent Plasticity, and the BCM Rule , 2007, Neural Computation.

[530]  M. Farries,et al.  Reinforcement learning with modulated spike timing dependent synaptic plasticity. , 2007, Journal of neurophysiology.

[531]  R. Malott,et al.  Principles of Behavior , 2007 .

[532]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[533]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .