High-Confidence Off-Policy Evaluation

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

[1]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[2]  T. W. Anderson CONFIDENCE LIMITS FOR THE EXPECTED VALUE OF AN ARBITRARY BOUNDED RANDOM VARIABLE WITH A CONTINUOUS DISTRIBUTION FUNCTION , 1969 .

[3]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[4]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[5]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[6]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[7]  H. Keselman,et al.  Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[8]  Mame Astou Diouf,et al.  Improved Nonparametric Inference for the Mean of a Bounded Random Variable with Application to Poverty Measures , 2005 .

[9]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[11]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[12]  Philip S. Thomas,et al.  Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.

[13]  Larry D. Pyeatt,et al.  Reinforcement Learning for Closed-Loop Propofol Anesthesia: A Human Volunteer Study , 2010, IAAI.

[14]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[15]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[16]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[17]  Farbod Fahimi,et al.  Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[18]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[19]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[20]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[21]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.