论文信息 - An Introduction to Deep Reinforcement Learning

An Introduction to Deep Reinforcement Learning

Deep reinforcement learning is the combination of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of complex decision-making tasks that were previously out of reach for a machine. Thus, deep RL opens up many new applications in domains such as healthcare, robotics, smart grids, finance, and many more. This manuscript provides an introduction to deep reinforcement learning models, algorithms and techniques. Particular focus is on the aspects related to generalization and how deep RL can be used for practical applications. We assume the reader is familiar with basic machine learning concepts.

[1] D. Whitteridge. Lectures on Conditioned Reflexes , 1942, Nature.

[2] Claude E. Shannon,et al. Programming a computer for playing chess , 1950 .

[3] R. Bellman. A Markovian Decision Process , 1957 .

[4] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[5] Stuart E. Dreyfus,et al. Applied Dynamic Programming , 1965 .

[6] Walter Dandy,et al. The Brain , 1966 .

[7] R. Rescorla. A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[8] D. Vere-Jones. Markov Chains , 1972, Nature.

[9] S. C. Jaquette. Markov Decision Processes with a New Optimality Criterion: Discrete Time , 1973 .

[10] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[11] Kunihiko Fukushima,et al. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition , 1982 .

[12] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[13] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[14] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[15] C. Watkins. Learning from delayed rewards , 1989 .

[16] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .

[17] B. Widrow,et al. Neural networks for self-learning control systems , 1990, IEEE Control Systems Magazine.

[18] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[19] Sebastian Thrun,et al. Efficient Exploration In Reinforcement Learning , 1992 .

[20] Bernd Brügmann Max-Planck. Monte Carlo Go , 1993 .

[21] Terrence J. Sejnowski,et al. Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[22] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[23] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[24] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[25] Deborah Silver,et al. Feature Visualization , 1994, Scientific Visualization.

[26] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[27] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[28] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[29] Geoffrey J. Gordon. Stable Fitted Reinforcement Learning , 1995, NIPS.

[30] Inman Harvey,et al. Noise and the Reality Gap: The Use of Simulation in Evolutionary Robotics , 1995, ECAL.

[31] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[32] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[33] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[34] Peter Dayan,et al. A Neural Substrate of Prediction and Reward , 1997, Science.

[35] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[36] Richard S. Sutton,et al. Roles of Macro-Actions in Accelerating Reinforcement Learning , 1998 .

[37] Milos Hauskrecht,et al. Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[38] Leslie Pack Kaelbling,et al. Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[39] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[41] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[42] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[43] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[44] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[45] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[46] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[47] Jay H. Lee,et al. Model predictive control: past, present and future , 1999 .

[48] Geoffrey J. Gordon,et al. Approximate solutions to markov decision processes , 1999 .

[49] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[50] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[51] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[52] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[53] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[54] Manuela M. Veloso,et al. Layered Learning , 2000, ECML.

[55] Peter L. Bartlett,et al. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[56] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[57] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[58] Murray Campbell,et al. Deep Blue , 2002, Artif. Intell..

[59] Clay B. Holroyd,et al. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. , 2002, Psychological review.

[60] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[61] D. Braziunas. POMDP solution methods , 2003 .

[62] Joelle Pineau,et al. Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.

[63] John Langford,et al. Exploration in Metric State Spaces , 2003, ICML.

[64] Remco R. Bouckaert,et al. Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[65] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[66] Gareth James,et al. Variance and Bias for General Loss Functions , 2003, Machine Learning.

[67] Corinna Cortes,et al. Support-Vector Networks , 1995, Machine Learning.

[68] Richard S. Sutton,et al. Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[69] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[70] Eibe Frank,et al. Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[71] Andrew W. Moore,et al. Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[72] Anja Vogler,et al. An Introduction to Multivariate Statistical Analysis , 2004 .

[73] Colin Camerer,et al. Neuroeconomics: How Neuroscience Can Inform Economics , 2005 .

[74] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[75] A. Barto,et al. An algebraic approach to abstraction in reinforcement learning , 2004 .

[76] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[77] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[78] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine Learning.

[79] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[80] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[81] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[82] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[83] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[84] Kurt Driessens,et al. Relational Reinforcement Learning , 1998, Machine-mediated learning.

[85] Pierre Geurts,et al. Extremely randomized trees , 2006, Machine Learning.

[86] Olivier Teytaud,et al. Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[87] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[88] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[89] Angela J. Yu,et al. Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[90] Nasser M. Nasrabadi,et al. Pattern Recognition and Machine Learning , 2006, Technometrics.

[91] Andy Liaw,et al. Classification and Regression by randomForest , 2007 .

[92] Csaba Szepesvári,et al. Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[93] Michael L. Littman,et al. Efficient Reinforcement Learning with Relocatable Action Models , 2007, AAAI.

[94] Louis Wehenkel,et al. Variable selection for dynamic treatment regimes: a reinforcement learning approach , 2008 .

[95] P. Dayan,et al. Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[96] P. Dayan,et al. Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[97] Marek Petrik,et al. Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[98] Thomas G. Dietterich. Machine Learning and Ecosystem Informatics: Challenges and Opportunities , 2009, ACML.

[99] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[100] Shimon Whiteson,et al. Automatic Feature Selection for Model-Based Reinforcement Learning in Factored MDPs , 2009, 2009 International Conference on Machine Learning and Applications.

[101] Y. Niv. Reinforcement learning in the brain , 2009 .

[102] Pascal Vincent,et al. Visualizing Higher-Layer Features of a Deep Network , 2009 .

[103] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[104] Brian Tanner,et al. RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[105] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[106] Wouter Josemans. Generalization in Reinforcement Learning , 2009 .

[107] P. Montague,et al. Theoretical and Empirical Studies of Learning , 2009 .

[108] Monica Dinculescu,et al. Approximate Predictive Representations of Partially Observable Systems , 2010, ICML.

[109] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[110] J. Andrew Bagnell,et al. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[111] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[112] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[113] A. Casadevall,et al. Reproducible Science , 2010, Infection and Immunity.

[114] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[115] Shimon Whiteson,et al. Protecting against evaluation overfitting in empirical reinforcement learning , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[116] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[117] Yi Sun,et al. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[118] Martin A. Riedmiller,et al. Reinforcement learning in feedback control , 2011, Machine Learning.

[119] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[120] D. Kahneman. Thinking, Fast and Slow , 2011 .

[121] Jan Peters,et al. Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[122] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[123] Ian D. Watson,et al. Applying reinforcement learning to small scale combat in the real-time strategy game StarCraft:Broodwar , 2012, 2012 IEEE Conference on Computational Intelligence and Games (CIG).

[124] Kevin P. Murphy,et al. Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[125] Regina Barzilay,et al. Learning High-Level Planning from Text , 2012, ACL.

[126] H. Seo,et al. Neural basis of reinforcement learning and decision making. , 2012, Annual review of neuroscience.

[127] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[128] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[129] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[130] Bruno Castro da Silva,et al. Learning Parameterized Skills , 2012, ICML.

[131] Anton Nekrutenko,et al. Ten Simple Rules for Reproducible Computational Research , 2013, PLoS Comput. Biol..

[132] Louis Wehenkel,et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[133] Sergey Levine,et al. Guided Policy Search , 2013, ICML.

[134] Stefan Schaal,et al. Learning objective functions for manipulation , 2013, 2013 IEEE International Conference on Robotics and Automation.

[135] P. Montague,et al. Reinforcement Learning Models Then-and-Now: From Single Cells to Modern Neuroimaging , 2013 .

[136] Pieter Abbeel,et al. Learning from Demonstrations Through the Use of Non-rigid Registration , 2013, ISRR.

[137] Qiang Yang,et al. Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[138] S. Barry Cooper,et al. Digital Computers Applied to Games , 2013 .

[139] Kris K. Hauser,et al. Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach , 2013, Artif. Intell. Medicine.

[140] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..