论文信息 - High-Confidence Off-Policy Evaluation - 字舞流文

High-Confidence Off-Policy Evaluation

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

Philip S. Thomas | Mohammad Ghavamzadeh | Georgios Theocharous | P. Thomas | M. Ghavamzadeh | Georgios Theocharous

[1] J. Kiefer,et al. Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[2] T. W. Anderson. CONFIDENCE LIMITS FOR THE EXPECTED VALUE OF AN ARBITRARY BOUNDED RANDOM VARIABLE WITH A CONTINUOUS DISTRIBUTION FUNCTION , 1969 .

[3] P. Massart. The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[4] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[5] Milos Hauskrecht,et al. Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[6] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[7] H. Keselman,et al. Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.

[8] Mame Astou Diouf,et al. Improved Nonparametric Inference for the Mean of a Bounded Random Variable with Application to Poverty Measures , 2005 .

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] P. Massart,et al. Concentration inequalities and model selection , 2007 .

[11] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[12] Philip S. Thomas,et al. Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.

[13] Larry D. Pyeatt,et al. Reinforcement Learning for Closed-Loop Propofol Anesthesia: A Human Volunteer Study , 2010, IAAI.

[14] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[15] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[16] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[17] Farbod Fahimi,et al. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[18] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[19] Bo Liu,et al. Regularized Off-Policy TD-Learning , 2012, NIPS.

[20] David Silver,et al. Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[21] Sergey Levine,et al. Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.