Efficient Inference and Exploration for Reinforcement Learning

Despite an ever growing literature on reinforcement learning algorithms and applications, much less is known about their statistical inference. In this paper, we investigate the large sample behaviors of the Q-value estimates with closed-form characterizations of the asymptotic variances. This allows us to efficiently construct confidence regions for Q-value and optimal value functions, and to develop policies to minimize their estimation errors. This also leads to a policy exploration strategy that relies on estimating the relative discrepancies among the Q estimates. Numerical experiments show superior performances of our exploration strategy than other benchmark approaches.

[1]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[2]  Peter W. Glynn,et al.  A large deviations perspective on ordinal optimization , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[3]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[4]  E. Altman Constrained Markov Decision Processes , 1999 .

[5]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[6]  Eric B. Laber,et al.  Dynamic treatment regimes: Technical challenges and applications , 2014 .

[7]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[8]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[9]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[10]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[11]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[12]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[13]  Craig Boutilier,et al.  Budget Allocation using Weakly Coupled, Constrained Markov Decision Processes , 2016, UAI.

[14]  Loo Hay Lee,et al.  Stochastic Simulation Optimization - An Optimal Computing Budget Allocation , 2010, System Engineering and Operations Research.

[15]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[16]  Uriel G. Rothblum,et al.  Splitting Randomized Stationary Policies in Total-Reward Markov Decision Processes , 2012, Math. Oper. Res..

[17]  Yi Zhu,et al.  Three asymptotic regimes for ranking and selection with general sample distributions , 2016, 2016 Winter Simulation Conference (WSC).

[18]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.