High-Confidence Off-Policy Evaluation
暂无分享,去创建一个
Philip S. Thomas | Mohammad Ghavamzadeh | Georgios Theocharous | P. Thomas | M. Ghavamzadeh | Georgios Theocharous
[1] J. Kiefer,et al. Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .
[2] T. W. Anderson. CONFIDENCE LIMITS FOR THE EXPECTED VALUE OF AN ARBITRARY BOUNDED RANDOM VARIABLE WITH A CONTINUOUS DISTRIBUTION FUNCTION , 1969 .
[3] P. Massart. The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .
[4] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .
[5] Milos Hauskrecht,et al. Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.
[6] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[7] H. Keselman,et al. Modern robust data analysis methods: measures of central tendency. , 2003, Psychological methods.
[8] Mame Astou Diouf,et al. Improved Nonparametric Inference for the Mean of a Bounded Random Variable with Application to Poverty Measures , 2005 .
[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[10] P. Massart,et al. Concentration inequalities and model selection , 2007 .
[11] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.
[12] Philip S. Thomas,et al. Application of the Actor-Critic Architecture to Functional Electrical Stimulation Control of a Human Arm , 2009, IAAI.
[13] Larry D. Pyeatt,et al. Reinforcement Learning for Closed-Loop Propofol Anesthesia: A Human Volunteer Study , 2010, IAAI.
[14] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.
[15] Wei Chu,et al. A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.
[16] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .
[17] Farbod Fahimi,et al. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.
[18] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.
[19] Bo Liu,et al. Regularized Off-Policy TD-Learning , 2012, NIPS.
[20] David Silver,et al. Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.
[21] Sergey Levine,et al. Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.