A Theoretical Analysis of Deep Q-Learning

Despite the great empirical success of deep reinforcement learning, its theoretical foundation is less well understood. In this work, we make the first attempt to theoretically understand the deep Q-network (DQN) algorithm (Mnih et al., 2015) from both algorithmic and statistical perspectives. In specific, we focus on a slight simplification of DQN that fully captures its key features. Under mild assumptions, we establish the algorithmic and statistical rates of convergence for the action-value functions of the iterative policy sequence obtained by DQN. In particular, the statistical error characterizes the bias and variance that arise from approximating the action-value function using deep neural network, while the algorithmic error converges to zero at a geometric rate. As a byproduct, our analysis provides justifications for the techniques of experience replay and target network, which are crucial to the empirical success of DQN. Furthermore, as a simple extension of DQN, we propose the Minimax-DQN algorithm for zero-sum Markov game with two players. Borrowing the analysis of DQN, we also quantify the difference between the policies obtained by Minimax-DQN and the Nash equilibrium of the Markov game in terms of both the algorithmic and statistical rates of convergence.

[1]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[2]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[3]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[4]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[5]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[6]  Wolfgang Maass,et al.  Neural Nets with Superlinear VC-Dimension , 1994, Neural Computation.

[7]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[8]  Dimitri P. Bertsekas,et al.  Stochastic shortest path games: theory and algorithms , 1997 .

[9]  Stephen D. Patek,et al.  Stochastic and shortest path games: theory and algorithms , 1997 .

[10]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[11]  Peter L. Bartlett,et al.  Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[12]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[13]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[16]  Roberto Frias,et al.  A brief survey , 2011 .

[17]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[18]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[19]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[20]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[21]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[22]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[23]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[24]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[25]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[28]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[29]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[30]  A. Antos,et al.  Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[31]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[32]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[33]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[34]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[35]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[36]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[37]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[38]  M. Kosorok,et al.  Reinforcement learning design for cancer clinical trials , 2009, Statistics in medicine.

[39]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[40]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[41]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[42]  S. Murphy,et al.  PERFORMANCE GUARANTEES FOR INDIVIDUALIZED TREATMENT RULES. , 2011, Annals of statistics.

[43]  M. Kosorok,et al.  Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer , 2011, Biometrics.

[44]  Inbal Nahum-Shani,et al.  Q-learning: a data analysis method for constructing adaptive interventions. , 2012, Psychological methods.

[45]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[46]  Donglin Zeng,et al.  Estimating Individualized Treatment Rules Using Outcome Weighted Learning , 2012, Journal of the American Statistical Association.

[47]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[48]  M. Kosorok,et al.  Q-LEARNING WITH CENSORED DATA. , 2012, Annals of statistics.

[49]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[50]  Eric B. Laber,et al.  A Robust Method for Estimating Optimal Treatment Regimes , 2012, Biometrics.

[51]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[52]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[53]  Michael R. Kosorok,et al.  Adaptive Q-learning , 2013 .

[54]  B. Chakraborty,et al.  Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine , 2013 .

[55]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[56]  Eric B. Laber,et al.  Dynamic treatment regimes: Technical challenges and applications , 2014 .

[57]  Anastasios A. Tsiatis,et al.  Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2012, Statistical science : a review journal of the Institute of Mathematical Statistics.

[58]  Bruno Scherrer,et al.  Rate of Convergence and Error Bounds for LSTD(λ) , 2014, ICML 2015.

[59]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[60]  Shalabh Bhatnagar,et al.  Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games , 2015, AAMAS.

[61]  Peter Sunehag,et al.  Reinforcement Learning in Large Discrete Action Spaces , 2015, ArXiv.

[62]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[63]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[64]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[65]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[66]  Donglin Zeng,et al.  New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes , 2015, Journal of the American Statistical Association.

[67]  Hassan Foroosh,et al.  Sparse Convolutional Neural Networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[69]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[70]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[71]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[72]  M R Kosorok,et al.  Penalized Q-Learning for Dynamic Treatment Regimens. , 2011, Statistica Sinica.

[73]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[74]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[75]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[76]  Jason M. Klusowski,et al.  Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks , 2016, 1607.01434.

[77]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[78]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[79]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[80]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[81]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[82]  Shie Mannor,et al.  Regularized Policy Iteration with Nonparametric Function Spaces , 2016, J. Mach. Learn. Res..

[83]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[84]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[85]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[86]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[87]  Anil A. Bharath,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[88]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[89]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[90]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[91]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[92]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[93]  Michael R Kosorok,et al.  Residual Weighted Learning for Estimating Individualized Treatment Rules , 2015, Journal of the American Statistical Association.

[94]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[95]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[96]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[97]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[98]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[99]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[100]  Marcello Restelli,et al.  Boosted Fitted Q-Iteration , 2017, ICML.

[101]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[102]  Eric B. Laber,et al.  Interactive Q-Learning for Quantiles , 2017, Journal of the American Statistical Association.

[103]  James Zou,et al.  The Effects of Memory Replay in Reinforcement Learning , 2017, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[104]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[105]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[106]  Chengchun Shi,et al.  HIGH-DIMENSIONAL A-LEARNING FOR OPTIMAL DYNAMIC TREATMENT REGIMES. , 2018, Annals of statistics.

[107]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[108]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[109]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[110]  Andrew R. Barron,et al.  Approximation and Estimation for High-Dimensional Deep Learning Networks , 2018, ArXiv.

[111]  Tamer Basar,et al.  Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[112]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[113]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[114]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[115]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[116]  Petros Koumoutsakos,et al.  Remember and Forget for Experience Replay , 2018, ICML.

[117]  Rui Song,et al.  Proper Inference for Value Function in High-Dimensional Q-Learning for Dynamic Treatment Regimes , 2018, Journal of the American Statistical Association.

[118]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[119]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[120]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[121]  Jiming Liu,et al.  Reinforcement Learning in Healthcare: A Survey , 2019, ACM Comput. Surv..

[122]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[123]  Qi Cai,et al.  Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy , 2019, ArXiv.

[124]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[125]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Networks , 2019, ArXiv.

[126]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[127]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[128]  M. Kohler,et al.  On deep learning as a remedy for the curse of dimensionality in nonparametric regression , 2019, The Annals of Statistics.

[129]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[130]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[131]  Zhiyuan Xu,et al.  Learning the Dynamic Treatment Regimes from Medical Registry Data through Deep Q-network , 2019, Scientific Reports.

[132]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[133]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[134]  Taiji Suzuki,et al.  Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , 2018, ICLR.

[135]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[136]  Anastasios A. Tsiatis,et al.  Dynamic Treatment Regimes , 2019 .

[137]  Greg Yang,et al.  A Fine-Grained Spectral Perspective on Neural Networks , 2019, ArXiv.

[138]  Nan Jiang,et al.  Information-Theoretic Considerations in Batch Reinforcement Learning , 2019, ICML.

[139]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[140]  Dale Schuurmans,et al.  Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.

[141]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[142]  Yuan Cao,et al.  Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks , 2019, NeurIPS.

[143]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[144]  J. Lee,et al.  Neural Temporal-Difference Learning Converges to Global Optima , 2019, NeurIPS.

[145]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[146]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[147]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Neural Networks , 2019, NeurIPS.

[148]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[149]  Tuo Zhao,et al.  Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? - A Neural Tangent Kernel Perspective , 2020, NeurIPS.

[150]  Quanquan Gu,et al.  A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation , 2019, ICML.

[151]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[152]  Rishabh Agarwal,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[153]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[154]  Zhaoran Wang,et al.  Neural Policy Gradient Methods: Global Optimality and Rates of Convergence , 2019, ICLR.

[155]  Cong Ma,et al.  A Selective Overview of Deep Learning , 2019, Statistical science : a review journal of the Institute of Mathematical Statistics.

[156]  Kaiqing Zhang,et al.  Finite-Sample Analysis for Decentralized Batch Multiagent Reinforcement Learning With Networked Agents , 2018, IEEE Transactions on Automatic Control.