Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represents states and actions in a low dimensional space. In this paper, we study reinforcement learning with feature mapping for discounted Markov Decision Processes (MDPs). We propose a novel algorithm which makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[3]  Dimitri P. Bertsekas,et al.  Feature-based aggregation and deep reinforcement learning: a survey and some new implementations , 2018, IEEE/CAA Journal of Automatica Sinica.

[4]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[5]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[6]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[7]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[8]  Yingbin Liang,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[11]  Hao Su,et al.  Regret Bounds for Discounted MDPs , 2020, ArXiv.

[12]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[13]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[14]  Ruosong Wang,et al.  Optimism in Reinforcement Learning with Generalized Linear Function Approximation , 2019, ICLR.

[15]  Dieter Fox,et al.  Reinforcement learning for sensing strategies , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[16]  Benjamin Van Roy,et al.  On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.

[17]  Alessandro Lazaric,et al.  Learning Near Optimal Policies with Low Inherent Bellman Error , 2020, ICML.

[18]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[19]  Prasad Tadepalli,et al.  Model-Based Reinforcement Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Lilian Besson,et al.  What Doubling Tricks Can and Can't Do for Multi-Armed Bandits , 2018, ArXiv.

[21]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[22]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[23]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[24]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[25]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[26]  Mengdi Wang,et al.  Model-Based Reinforcement Learning with Value-Targeted Regression , 2020, L4DC.

[27]  Tor Lattimore,et al.  Learning with Good Feature Representations in Bandits and in RL with a Generative Model , 2020, ICML.

[28]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[29]  Lin F. Yang,et al.  Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model , 2018, 1806.01492.

[30]  Nicholas Jing Yuan,et al.  DRN: A Deep Reinforcement Learning Framework for News Recommendation , 2018, WWW.

[31]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[32]  Ambuj Tewari,et al.  Sample Complexity of Reinforcement Learning using Linearly Combined Model Ensembles , 2019, AISTATS.

[33]  Benjamin Van Roy,et al.  Comments on the Du-Kakade-Wang-Yang Lower Bounds , 2019, ArXiv.

[34]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[35]  Ruosong Wang,et al.  Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning? , 2020, ICLR.

[36]  Chong Li,et al.  Model-Based Reinforcement Learning , 2019, Reinforcement Learning for Cyber-Physical Systems.

[37]  Nan Jiang,et al.  Contextual Decision Processes with low Bellman rank are PAC-Learnable , 2016, ICML.

[38]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[39]  Mengdi Wang,et al.  Reinforcement Leaning in Feature Space: Matrix Bandit, Kernels, and Regret Bound , 2019, ICML.

[40]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.