论文信息 - A Theoretical Analysis of Deep Q-Learning

A Theoretical Analysis of Deep Q-Learning

Despite the great empirical success of deep reinforcement learning, its theoretical foundation is less well understood. In this work, we make the first attempt to theoretically understand the deep Q-network (DQN) algorithm (Mnih et al., 2015) from both algorithmic and statistical perspectives. In specific, we focus on a slight simplification of DQN that fully captures its key features. Under mild assumptions, we establish the algorithmic and statistical rates of convergence for the action-value functions of the iterative policy sequence obtained by DQN. In particular, the statistical error characterizes the bias and variance that arise from approximating the action-value function using deep neural network, while the algorithmic error converges to zero at a geometric rate. As a byproduct, our analysis provides justifications for the techniques of experience replay and target network, which are crucial to the empirical success of DQN. Furthermore, as a simple extension of DQN, we propose the Minimax-DQN algorithm for zero-sum Markov game with two players. Borrowing the analysis of DQN, we also quantify the difference between the policies obtained by Minimax-DQN and the Nash equilibrium of the Markov game in terms of both the algorithmic and statistical rates of convergence.

[1] E. Rowland. Theory of Games and Economic Behavior , 1946, Nature.

[2] L. Shapley,et al. Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[3] J. Friedman,et al. Projection Pursuit Regression , 1981 .

[4] C. J. Stone,et al. Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[5] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[6] Wolfgang Maass,et al. Neural Nets with Superlinear VC-Dimension , 1994, Neural Computation.

[7] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[8] Dimitri P. Bertsekas,et al. Stochastic shortest path games: theory and algorithms , 1997 .

[9] Stephen D. Patek,et al. Stochastic and shortest path games: theory and algorithms , 1997 .

[10] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[11] Peter L. Bartlett,et al. Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[12] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[13] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[14] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[16] Roberto Frias,et al. A brief survey , 2011 .

[17] Michail G. Lagoudakis,et al. Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[18] S. Murphy,et al. Optimal dynamic treatment regimes , 2003 .

[19] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[21] Steven J. Bradtke,et al. Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[22] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[23] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[24] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[25] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[26] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27] Susan A. Murphy,et al. A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[28] Vincent Conitzer,et al. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[29] Csaba Szepesvári,et al. Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[30] A. Antos,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[31] Csaba Szepesvári,et al. Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[32] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[33] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[34] Alex Smola,et al. Kernel methods in machine learning , 2007, math/0701907.

[35] Benjamin Recht,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[36] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[37] Shie Mannor,et al. Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[38] M. Kosorok,et al. Reinforcement learning design for cancer clinical trials , 2009, Statistics in medicine.

[39] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[40] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[41] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[42] S. Murphy,et al. PERFORMANCE GUARANTEES FOR INDIVIDUALIZED TREATMENT RULES. , 2011, Annals of statistics.

[43] M. Kosorok,et al. Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer , 2011, Biometrics.

[44] Inbal Nahum-Shani,et al. Q-learning: a data analysis method for constructing adaptive interventions. , 2012, Psychological methods.

[45] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[46] Donglin Zeng,et al. Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[47] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.

[48] M. Kosorok,et al. Q-LEARNING WITH CENSORED DATA. , 2012, Annals of statistics.

[49] Ameet Talwalkar,et al. Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[50] Eric B. Laber,et al. A Robust Method for Estimating Optimal Treatment Regimes , 2012, Biometrics.

[51] Alessandro Lazaric,et al. Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[52] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[53] Michael R. Kosorok,et al. Adaptive Q-learning , 2013 .

[54] B. Chakraborty,et al. Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine , 2013 .

[55] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[56] Eric B. Laber,et al. Dynamic treatment regimes: Technical challenges and applications , 2014 .

[57] Anastasios A. Tsiatis,et al. Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[58] Bruno Scherrer,et al. Rate of Convergence and Error Bounds for LSTD(λ) , 2014, ICML 2015.

[59] Ryota Tomioka,et al. Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[60] Shalabh Bhatnagar,et al. Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games , 2015, AAMAS.

[61] Peter Sunehag,et al. Reinforcement Learning in Large Discrete Action Spaces , 2015, ArXiv.

[62] Richard Evans,et al. Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[63] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[64] Ruslan Salakhutdinov,et al. Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[65] Matthieu Geist,et al. Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[66] Donglin Zeng,et al. New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2015, Journal of the American Statistical Association.

[67] Hassan Foroosh,et al. Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[69] Bruno Scherrer,et al. Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[70] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[71] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[72] M R Kosorok,et al. Penalized Q-Learning for Dynamic Treatment Regimens. , 2011, Statistica Sinica.

[73] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[74] Bruno Scherrer,et al. On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[75] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[76] Jason M. Klusowski,et al. Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks , 2016, 1607.01434.

[77] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[78] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[79] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[80] Matthieu Geist,et al. Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[81] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[82] Shie Mannor,et al. Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[83] Marc Peter Deisenroth,et al. Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[84] Johannes Schmidt-Hieber,et al. Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[85] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[86] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.

[87] Anil A. Bharath,et al. Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[88] Gintare Karolina Dziugaite,et al. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[89] Shie Mannor,et al. Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[90] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[91] Matus Telgarsky,et al. Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[92] Francis R. Bach,et al. On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[93] Michael R Kosorok,et al. Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[94] Richard S. Sutton,et al. A Deeper Look at Experience Replay , 2017, ArXiv.

[95] Jiliang Tang,et al. A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[96] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[97] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[98] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.

[99] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[100] Marcello Restelli,et al. Boosted Fitted Q-Iteration , 2017, ICML.

[101] Chen-Yu Wei,et al. Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[102] Eric B. Laber,et al. Interactive Q-Learning for Quantiles , 2017, Journal of the American Statistical Association.

[103] James Zou,et al. The Effects of Memory Replay in Reinforcement Learning , 2017, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[104] Ohad Shamir,et al. Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[105] David A. McAllester,et al. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[106] Chengchun Shi,et al. HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES. , 2018, Annals of statistics.

[107] Michael H. Bowling,et al. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[108] Francis Bach,et al. A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[109] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[110] Andrew R. Barron,et al. Approximation and Estimation for High-Dimensional Deep Learning Networks , 2018, ArXiv.

[111] Tamer Basar,et al. Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[112] Olivier Pietquin,et al. Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[113] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[114] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[115] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[116] Petros Koumoutsakos,et al. Remember and Forget for Experience Replay , 2018, ICML.

[117] Rui Song,et al. Proper Inference for Value Function in High-Dimensional Q-Learning for Dynamic Treatment Regimes , 2018, Journal of the American Statistical Association.

[118] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[119] Joelle Pineau,et al. Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[120] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[121] Jiming Liu,et al. Reinforcement Learning in Healthcare: A Survey , 2019, ACM Comput. Surv..

[122] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[123] Qi Cai,et al. Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[124] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[125] Cho-Jui Hsieh,et al. Convergence of Adversarial Training in Overparametrized Networks , 2019, ArXiv.

[126] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[127] Julien Mairal,et al. On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[128] M. Kohler,et al. On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[129] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[130] Francis Bach,et al. On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[131] Zhiyuan Xu,et al. Learning the Dynamic Treatment Regimes from Medical Registry Data through Deep Q-network , 2019, Scientific Reports.

[132] Greg Yang,et al. Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[133] Peter L. Bartlett,et al. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[134] Taiji Suzuki,et al. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , 2018, ICLR.

[135] Yuan Cao,et al. A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[136] Anastasios A. Tsiatis,et al. Dynamic Treatment Regimes , 2019 .

[137] Greg Yang,et al. A Fine-Grained Spectral Perspective on Neural Networks , 2019, ArXiv.

[138] Nan Jiang,et al. Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[139] Matthieu Geist,et al. A Theory of Regularized Markov Decision Processes , 2019, ICML.

[140] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[141] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[142] Yuan Cao,et al. Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks , 2019, NeurIPS.

[143] Tomaso A. Poggio,et al. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[144] J. Lee,et al. Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.

[145] Gilad Yehudai,et al. On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[146] Andrea Montanari,et al. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[147] Cho-Jui Hsieh,et al. Convergence of Adversarial Training in Overparametrized Neural Networks , 2019, NeurIPS.

[148] Jason D. Lee,et al. Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[149] Tuo Zhao,et al. Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[150] Quanquan Gu,et al. A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation , 2019, ICML.

[151] Quanquan Gu,et al. Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[152] Rishabh Agarwal,et al. An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[153] Lei Wu,et al. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[154] Zhaoran Wang,et al. Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[155] Cong Ma,et al. A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[156] Kaiqing Zhang,et al. Finite-Sample Analysis for Decentralized Batch Multiagent Reinforcement Learning With Networked Agents , 2018, IEEE Transactions on Automatic Control.